| What is hwpoison? |
| |
| Upcoming Intel CPUs have support for recovering from some memory errors |
| (``MCA recovery''). This requires the OS to declare a page "poisoned", |
| kill the processes associated with it and avoid using it in the future. |
| |
| This patchkit implements the necessary infrastructure in the VM. |
| |
| To quote the overview comment: |
| |
| * High level machine check handler. Handles pages reported by the |
| * hardware as being corrupted usually due to a 2bit ECC memory or cache |
| * failure. |
| * |
| * This focusses on pages detected as corrupted in the background. |
| * When the current CPU tries to consume corruption the currently |
| * running process can just be killed directly instead. This implies |
| * that if the error cannot be handled for some reason it's safe to |
| * just ignore it because no corruption has been consumed yet. Instead |
| * when that happens another machine check will happen. |
| * |
| * Handles page cache pages in various states. The tricky part |
| * here is that we can access any page asynchronous to other VM |
| * users, because memory failures could happen anytime and anywhere, |
| * possibly violating some of their assumptions. This is why this code |
| * has to be extremely careful. Generally it tries to use normal locking |
| * rules, as in get the standard locks, even if that means the |
| * error handling takes potentially a long time. |
| * |
| * Some of the operations here are somewhat inefficient and have non |
| * linear algorithmic complexity, because the data structures have not |
| * been optimized for this case. This is in particular the case |
| * for the mapping from a vma to a process. Since this case is expected |
| * to be rare we hope we can get away with this. |
| |
| The code consists of a the high level handler in mm/memory-failure.c, |
| a new page poison bit and various checks in the VM to handle poisoned |
| pages. |
| |
| The main target right now is KVM guests, but it works for all kinds |
| of applications. KVM support requires a recent qemu-kvm release. |
| |
| For the KVM use there was need for a new signal type so that |
| KVM can inject the machine check into the guest with the proper |
| address. This in theory allows other applications to handle |
| memory failures too. The expection is that near all applications |
| won't do that, but some very specialized ones might. |
| |
| --- |
| |
| There are two (actually three) modi memory failure recovery can be in: |
| |
| vm.memory_failure_recovery sysctl set to zero: |
| All memory failures cause a panic. Do not attempt recovery. |
| (on x86 this can be also affected by the tolerant level of the |
| MCE subsystem) |
| |
| early kill |
| (can be controlled globally and per process) |
| Send SIGBUS to the application as soon as the error is detected |
| This allows applications who can process memory errors in a gentle |
| way (e.g. drop affected object) |
| This is the mode used by KVM qemu. |
| |
| late kill |
| Send SIGBUS when the application runs into the corrupted page. |
| This is best for memory error unaware applications and default |
| Note some pages are always handled as late kill. |
| |
| --- |
| |
| User control: |
| |
| vm.memory_failure_recovery |
| See sysctl.txt |
| |
| vm.memory_failure_early_kill |
| Enable early kill mode globally |
| |
| PR_MCE_KILL |
| Set early/late kill mode/revert to system default |
| arg1: PR_MCE_KILL_CLEAR: Revert to system default |
| arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode |
| PR_MCE_KILL_EARLY: Early kill |
| PR_MCE_KILL_LATE: Late kill |
| PR_MCE_KILL_DEFAULT: Use system global default |
| PR_MCE_KILL_GET |
| return current mode |
| |
| |
| --- |
| |
| Testing: |
| |
| madvise(MADV_POISON, ....) |
| (as root) |
| Poison a page in the process for testing |
| |
| |
| hwpoison-inject module through debugfs |
| |
| /sys/debug/hwpoison/ |
| |
| corrupt-pfn |
| |
| Inject hwpoison fault at PFN echoed into this file. |
| |
| unpoison-pfn |
| |
| Software-unpoison page at PFN echoed into this file. This |
| way a page can be reused again. |
| This only works for Linux injected failures, not for real |
| memory failures. |
| |
| Note these injection interfaces are not stable and might change between |
| kernel versions |
| |
| corrupt-filter-dev-major |
| corrupt-filter-dev-minor |
| |
| Only handle memory failures to pages associated with the file system defined |
| by block device major/minor. -1U is the wildcard value. |
| This should be only used for testing with artificial injection. |
| |
| Architecture specific MCE injector |
| |
| x86 has mce-inject, mce-test |
| |
| Some portable hwpoison test programs in mce-test, see blow. |
| |
| --- |
| |
| References: |
| |
| http://halobates.de/mce-lc09-2.pdf |
| Overview presentation from LinuxCon 09 |
| |
| git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git |
| Test suite (hwpoison specific portable tests in tsrc) |
| |
| git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git |
| x86 specific injector |
| |
| |
| --- |
| |
| Limitations: |
| |
| - Not all page types are supported and never will. Most kernel internal |
| objects cannot be recovered, only LRU pages for now. |
| - Right now hugepage support is missing. |
| |
| --- |
| Andi Kleen, Oct 2009 |
| |