HWPOISON

August 26, 2009

This article was contributed by Jon Ashburn

One downside to the ever-increasing memory size available on computers is an increase in memory failures. As memory density increases, error rates also rise. To offset this increased error rate, recent processors have included support for "poisoned" memory, an adaptive method for flagging and recovering from memory errors. The HWPOISON patch recently developed by Andi Kleen and Fengguang Wu provides the Linux kernel support for memory poisoning. Thus, when HWPOISON is coupled with the appropriate fault-tolerant processors, Linux users can enjoy systems that are more tolerant to memory errors in spite of increased memory densities.

Memory errors are classified as either soft (transient) or hard (permanent). In soft errors, cosmic rays or random errors can toggle the state of a bit in a SRAM or DRAM memory cell. In hard errors, memory cells become physically degraded. Hardware can detect - and automatically correct - some of these errors via Error Correcting Codes (ECC). While single bit data errors can be corrected via ECC, multi-bit data errors cannot. For these uncorrectable errors, the hardware typically generates a trap which, in turn, causes a kernel panic.

The blanket action of crashing the machine for all uncorrected soft and hard memory errors is sometimes over-reactive. If the detected memory error never actually corrupts executing software, then ignoring or isolating the error is the most desirable action. Memory "poisoning", with its delayed handling of errors, allows for a more graceful recovery from and isolation of uncorrected memory errors rather than just crashing the system. However, memory poisoning requires both hardware and kernel support.

The HWPOISON patch is very timely: Intel's recent preview of its Xeon processor (codenamed Nehalem-EX) promises support for memory poisoning. Intel has included its Machine Check Abort (MCA) Recovery architecture in Nehalem-EX. Originally developed for ia64 processors, Intel's MCA Recovery architecture supports memory poisoning and various other hardware failure recovery mechanisms. While, HWPOISON adopted Intel's usage of the term "poisoning", this should not be confused with the unrelated Linux kernel concept of poisoning: writing a pattern to memory to catch uninitialized memory.

While the specifics of how hardware and the kernel might implement memory poisoning varies, the general concept is as follows. First, hardware detects an uncorrectable error from memory transfers into the system cache or on the system bus. Alternatively, memory may be occasionally "scrubbed." That is, a background process may initiate an ECC check on one or more memory pages. In either case, the hardware doesn't immediately cause a machine check but rather flags the data unit as poisoned until read (or consumed). Later, when erroneous data is read by executing software, a machine check is initiated. If the erroneous data is never read, no machine check is necessary. For example, a modified cache line written back to main memory may have a data word error that is marked as poisoned. Once the poisoned data is actually used (loaded into a processor register, etc.), a machine check occurs, but not before. Thus, any poisoning machine check event may happen long after the corresponding data error event.

HWPOISON is a poisoned data handler invoked by the low-level Linux machine check code. Where possible, HWPOISON attempts to gracefully recover from memory errors, and contain faulty hardware to prevent future errors. At first glance, an obvious solution for the poison handler would focus on the specific process and memory address(es) associated with the data error. However, this is infeasible for two reasons. First, the offending instruction and process cannot be determined due to delays between the data error consumption and execution of the poison handler. These delays include asynchronous hardware reporting of the machine check event, and delayed execution of the handler via a workqueue. Thus, a different process may be executing by the time the HWPOISON handler is ready to act. Second, bad-memory containment must be done at a level where the kernel actually manages memory. Thus, HWPOISON focuses on memory containment at the page granularity rather than the low granularity supported by Intel's MCA Recovery hardware.

HWPOISON finds the page containing the poisoned data and attempts to isolate this page from further use. Potentially corrupted processes can then be located by finding all processes that have the corrupted page mapped. HWPOISON performs a variety of different actions. Its exact behavior depends upon the type of corrupted page and various kernel configuration parameters.

To enable the HWPOISON handler, the kernel configuration parameter MEMORY_FAILURE must be set. Otherwise, hardware poisoning will cause a system panic. Additionally, the architecture must support data poisoning. As of this writing, HWPOISON is enabled for all architectures to make testing on any machine possible via a user-mode fault injector, which is detailed below.

The handler must allow for multiple poisoning events occurring in a short time window. HWPOISON uses a bit in the flags field of a struct page to mark and lock a page as poisoned. Since page flags are currently in short supply, this choice was not made without consternation and debate by kernel hackers. See this LWN article for further details about this issue. In any case, this bit allows previously poisoned pages to be ignored by the handler.

The handler ignores the following types of pages: 1) pages that have been previously poisoned, 2) pages that are outside of kernel control (an invalid page frame number), 3) reserved kernel pages, and 4) pages with usage count of zero, which implies either a free or higher order kernel page. The poisoned bit in the flags field serves as a lock allowing rapid-fire poisoning machine checks on the same page to be handled only once by ignoring subsequent calls to the handler. Reserved kernel pages and zero count pages are ignored with the peril of a system panic. However, these pages containing critical kernel data cannot be isolated. Thus, HWPOISON has no useful options for recovery.

In addition to ignoring pages, possible HWPOISON actions include recovery, delay, and failure. Recovery means HWPOISON took action to isolate a page. Ignore, failure, and delay are all similar in that the page was not completely isolated, except for flagging the page as poisoned. With delay, handling can be safely postponed until a later time when the page might be referenced. By delaying, some transient errors may not reoccur or may be irrelevant. HWPOISON delays any action on kernel slab or buddy allocator pages or free pages. With failure, HWPOISON could, but does not support handling the page. HWPOISON takes an action of failure on unknown or huge pages. Huge pages fail since reverse mapping is not supported to identify the process which owns the page.

Clean pages in either the swap or page cache can be easily recovered by invalidating the cache entry for these pages. Since these pages have a duplicate backing copy on disk, the in-memory cache copy can be invalidated. Unlike clean pages, dirty pages in these caches have differences between the memory and disk copies. Thus, poisoned dirty pages may have important data corruption. However, dirty pages in the page cache are recovered by invalidation of the cache. Additionally, a page error is set for the dirty page cache page so subsequent user system calls on the file associated with the page will return an I/O error. Dirty pages in the swap cache are handled in a delayed fashion. The dirty flag is cleared for the page and the page swap cache entry is maintained. On a later page fault the associated application will be killed.

To recover from poisoned, user-mapped pages, HWPOISON first finds all user processes which mapped the corrupted page. For clean pages with backing store, HWPOISON need not take recovery action since the process does not need to be killed. Dirty pages are unmapped from all associated processes, which are subsequently killed. Two VM sysctl parameters are supported by HWPOISON with respect to killing user processes: vm.memory_failure_early_kill and vm.memory_failure_recovery. Setting the vm.memory_failure_early_kill parameter causes an immediate SIGBUS to be sent to the user process(es). The kill is done using a catchable SIGBUS with BUS_MCEERR_AO. Thus, processes can decide how they want to handle the data poisoning. The vm.memory_failure_recovery parameter delays the killing: the page is merely unmapped by HWPOISON. When this unmapped page is actually referenced at a later time then a SIGBUS will be sent.

An HWPOISON patch git repository is available at

    git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison

Since faulty hardware that supports data poisoning is not easy to come by, a fault injection test harness mm/hwpoison-inject.c has also been developed. This simple harness uses debugfs to allow failures at an arbitrary page to be injected.

While HWPOISON was developed for x86-based machines, interest has been expressed by supporters of other Linux server architectures, such as ia64 and sparc (discussed here). Thus, the patch may proliferate on future Linux server distributions, allowing users of future Linux servers to enjoy increased fault tolerance. Now that Intel is supporting MCA Recovery on x86 machines, some desktop users may also enjoy its benefits in the near future.

Index entries for this article
Kernel	Fault tolerance
Kernel	HWPOISON
GuestArticles	Ashburn, Jon

HWPOISON

Posted Aug 27, 2009 22:35 UTC (Thu) by giraffedata (guest, #1954) [Link] (16 responses)

First, hardware detects an uncorrectable error from memory transfers into the system cache or on the system bus.
...
Later, when erroneous data is read by executing software, a machine check is initiated.
...
First, the offending instruction and process cannot be determined due to delays between the data error consumption and execution of the poison handler. These delays include asynchronous hardware reporting of the machine check event,

How can a machine check for accessing erroneous memory contents be asynchronous? An instruction to load some data from memory didn't get the data because it's been destroyed. How can the CPU continue executing and generate a machine check at some arbitrarily later time?

HWPOISON

Posted Aug 28, 2009 4:31 UTC (Fri) by roelofs (guest, #2599) [Link] (15 responses)

Er, maybe I'm missing the thrust of your question, but I thought it was sort of straightforward: the hardware detects the problem as soon as memory is read--imagine a bad bit in a single byte out of a page or a cacheline's worth read--but the specific bad subset of that memory (the byte) may not be used until much later, or not at all.

Or are you asking about something much more subtle?

Greg

HWPOISON

Posted Aug 28, 2009 7:10 UTC (Fri) by giraffedata (guest, #1954) [Link] (14 responses)

the hardware detects the problem as soon as memory is read--imagine a bad bit in a single byte out of a page or a cacheline's worth read--but the specific bad subset of that memory (the byte) may not be used until much later, or not at all.

Yes, that's the scenario in the sentences I excerpted from the article.

And they go on to say that the poison handler runs some time after the time that the specific bad subset is used. It refers to the specific bad subset being used as "data error consumption" and the instruction that uses it as the "offending instruction" and says you can't simply locate the offending instruction and thereby the memory location and the process that are affected by the bad memory, because of the delay.

Maybe the article is confusing multiple scenarios. I can definitely see a design where the machine check happens, and the OS deals with it, before the data error is consumed. But that's not the case the article describes.

HWPOISON

Posted Aug 31, 2009 6:36 UTC (Mon) by jzbiciak (guest, #5246) [Link] (13 responses)

There are a couple things at play here:

The MCA can occur on any "word", where "word" is defined by the width of the ECC code applied at the corresponding level of memory. It could be a 64-bit word on a 64-bit + 8-bit DRAM bus, or it could be on the order of a 64-byte cache line. (I think Athlon's on-chip ECC works on whole cache lines, but I admit to not knowing for sure. I know a particular DSP core's L2 cache ECC works in terms of 256-bit data phases on the chips that support that feature.)
The CPU need not have referenced the particular word that triggered the fault. A CPU read, or better yet, a data prefetch (either triggered explicitly by an instruction or implicitly by a prefetch engine) may have triggered the memory reference that triggered the MCA. If the faulting word is due to a prefetch, or is late in the cache line that was read due to a demand fetch, that data may arrive at the CPU quite long after the instruction that triggered that line fill.
Whether or not the CPU referenced the particular word that triggered the fault, the existing MCA may consider such faults catastrophic at the task level, and so does not bother to precisely track which instruction(s) may have consumed the bogus data. (See Chapter 15 in this reference where it says: "The implementation of the machine-check architecture does not ordinarily permit the processor to be restarted reliably after generating a machine-check exception.") All that's necessary is to keep track of which task(s) to kill, which is mainly a function of keeping track of the physical address that had a fault.
In some systems, the MC exception could be asserted by the chipset, not the CPU. The chipset may actually detect the fault and alert the CPU via an exception pin, but nothing really aligns that exception to the data's arrival. Note that this property would be system dependent—not all systems would necessarily be this imprecise.

HWPOISON

Posted Aug 31, 2009 6:41 UTC (Mon) by jzbiciak (guest, #5246) [Link] (12 responses)

Oh, and I forgot to mention, some machine check exceptions/aborts could have been triggered due to background scrubbing. Background scrubbing is entirely asynchronous to process execution.

HWPOISON

Posted Aug 31, 2009 16:02 UTC (Mon) by dlang (guest, #313) [Link] (11 responses)

if background scrubbing triggers a read error with HWPOISON, that defeats the purpose of doing the HWPoISON in the first place, you may as well just die when you first detect the error.

the key of HWPOISON is that not all memory locations contain irreplaceable data. in some cases the memory may not be allocated (so when the program goes to use it, whatever contents are there are going to be erased anyway), on other cases the data exists elsewhere (clean disk buffer pages that can be re-read from disk, etc)

so instead of erroring out when memory corruption is detected, it only throws an error if something tries to make use of the corrupt data, and even then it throws an error that the OS can catch and deal with (since only the OS knows if the data can be replaced by something read from somewhere else)

HWPOISON

Posted Aug 31, 2009 18:50 UTC (Mon) by jzbiciak (guest, #5246) [Link] (10 responses)

Background scrubbing works by reading memory locations, checking the ECC, and correcting correctable errors proactively before they become uncorrectable. If background scrubbing detects something uncorrectable, it can (and it seems like it ought to) signal a machine check.

Take a look here:

http://patchwork.kernel.org/patch/16897/

There is a notion of an "action optional" machine check. It's still a machine check, and it can be triggered by scrubbing. Quoting:

Action Optional means that the CPU detected some form of corruption in the background and tells the OS about using a machine check exception. The OS can then take appropriate action, like killing the process with the corrupted data or logging the event properly to disk.

This code snippet on the linked page illustrates some of the "action optional" machine check exceptions:

+
+	/* known AO MCACODs: handle by calling high level handler */
+	MASK(MCI_UC_SAR|0xfff0, MCI_UC_S|0xc0, AO,
+	     "Action optional: memory scrubbing error", SER),
+	MASK(MCI_UC_SAR|MCACOD, MCI_UC_S|0x17a, AO,
+	     "Action optional: last level cache writeback error", SER),
+

HWPOISON

Posted Aug 31, 2009 21:06 UTC (Mon) by dlang (guest, #313) [Link] (9 responses)

yes, that is how things traditionally worked.

however, the win here is to not generate a machine check when corrupted memory is detected, but instead wait to see if it matters.

if a memory location is corrupted, but then written before it's read from, the fact that the memory location was corrupt doesn't matter, nothing ever tried to use the corrupted data.

this can be done in hardware, transparent to the OS. it will make systems less likely to crash at the cost of a little more record keeping in the hardware.

if a memory location is corrupted, but it happens to be in a page that is a clean cache, the OS can respond to the error by throwing away the cached page and retrieving a copy from disk.

since in modern systems a _large_ percentage of memory ends up being occupied by caches, making it so that errors in that memory just cause a momentary slowdown (read to the disk) instead of a system crash is also a significant win.

and finally, if both of the above fail (so the memory contents are irreplaceable) the OS can detect what program it was running on that CPU at the time the read took place, and kill just that program (and log that the program was killed due to hardware memory errors, not an application bug) rather than killing the entire system.

none of these protections guarantee that the system won't crash when cosmic rays hit the ram, but each of these steps makes it less likely to crash.

given common use cases, I wouldn't be surprised to find that these sorts of strategies make systems an order or two of magnitude less likely to crash as a result of memory errors (although the gains in application reliability will not be as large due to the fact that some of the gain is in killing applications instead of the entire system.

HWPOISON

Posted Aug 31, 2009 23:34 UTC (Mon) by jzbiciak (guest, #5246) [Link]

You're missing the point. MCE is the mechanism by which the hardware reports the bad page to the operating system. "Action Optional" means the OS can do just as you suggest: Try to keep everything running as smoothly as possible and only bringing down the affected tasks if any.

You seem to be assuming "machine check" means "machine halt." It's just the name of the exception vector.

HWPOISON

Posted Aug 31, 2009 23:40 UTC (Mon) by jzbiciak (guest, #5246) [Link] (7 responses)

I'll quote Andi Kleen's post (that I linked above) since I think it's abundantly clear:

Newer Intel CPUs support a new class of machine checks called recoverable action optional. Action Optional means that the CPU detected some form of corruption in the background and tells the OS about using a machine check exception. The OS can then take appropiate action, like killing the process with the corrupted data or logging the event properly to disk.

Read that again: Background scrubbing gives a machine check. The machine check is action optional and it can do just as you suggest. It's still a machine check.

HWPOISON

Posted Sep 1, 2009 17:59 UTC (Tue) by dlang (guest, #313) [Link] (6 responses)

as I understand it, HWPOISON changes this.

instead of the background scrub triggering a machine check at that point in time it instead just marks the memory as corrupt (poisoned). the poisoned flag gets cleared if the memory is written to.

if nothing ever tries to read the poisoned memory a machine check happens at that point in time.

HWPOISON

Posted Sep 1, 2009 20:24 UTC (Tue) by jzbiciak (guest, #5246) [Link] (2 responses)

That's not how I read this. See section 15.6, "Recovery of Uncorrected Recoverable Errors" and especially 15.6.3, "UCR Error Classification".

The first two error types are the "an error was detected, but the CPU hasn't consumed the errant data yet" error types. If you want to pick nits, the first one (UCNA) is not reported as a Machine Check Exception; rather it is reported as a Corrected Machine Check Error Interrupt (described in Section 15.5). My bad for being sloppy; it is a Machine Check Error, but it isn't a Machine Check Exception. The second recoverable error type (SRAO) is a Machine Check Exception, however.

In any case, both are machine checks.

Now flip with me to page 15-34 and look at what SRAO errors are architecturally defined, there in section 15.9.3.1:

The following two SRAO errors are architecturally defined.
UCR Errors detected by memory controller scrubbing; and

UCR Errors detected during L3 cache (L3) explicit writebacks.

So there we have it. Recoverable, Action Optional Machine Checks due to scrubbing. Can it be any clearer? In case you think this feature is old and was supplanted by something more recent, I urge you to flip back to 15-23 and read along here at the intro to Section 15.6:

Recovery of uncorrected recoverable machine check errors is an enhancement in machine-check architecture. The first processor that supports this feature is 45nm Intel 64 processor with CPUID signature DisplayFamily_DisplayModel encoding of 06H_2EH. This allow system soft- ware to perform recovery action on certain class of uncorrected errors and continue

If I'm not mistaken, that's the processor family this article was referring to. (This document is dated June 2009, so it's not like it's anceint.)

Do you have different documentation that suggests otherwise?

HWPOISON

Posted Sep 1, 2009 21:28 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

ok, I think the question comes down to this

is HWPOISON a hardware level feature or a OS level feature?

if it's a hardware level feature (which is what I understood from the original article) then it wouldn't necessarily cause a machine check error ever.

if this is instead a difference in how the OS responds to a memory error I just completely misunderstood what's happening.

HWPOISON

Posted Sep 1, 2009 23:59 UTC (Tue) by jzbiciak (guest, #5246) [Link]

The overall "HWPOISON feature" is both a hardware feature and a software feature. There is a hardware component (the newly improved Machine Check Architecture in the CPU), and then there's the OS handler that makes use of it.

A machine check error (whether delivered as an exception or an interrupt--the new MCA does both depending on the error type) is a message from the hardware to the software. In the most recent Intel architectures, they support a notion of "recoverable machine check," wherein the hardware tells the OS that no CPU state was corrupted when it noticed the problem. If you look at that PDF I linked, there are a number of status bits (including AR--Action Required) that indicate the severity of the error. There's a separate table in Intel's PDF that suggests the possible OS responses to a particular error.

Once the hardware delivers the message to the OS (via a machine check), the OS is then free to deal with the machine check however it pleases. For "Action Optional" machine checks that can happen asynchronously to program execution (such as due to scrubbing), the OS can queue up a handler to go deal with the affected page, either by poisoning it or unmapping it or what-have-you. That's the stuff Andi Kleen and co.'s patch does.

HWPOISON

Posted Sep 1, 2009 20:26 UTC (Tue) by jzbiciak (guest, #5246) [Link] (2 responses)

I guess what you're missing is who marks the memory as poisoned. The CPU sends a machine check to the OS. The OS marks the memory as poisoned, or otherwise discards the contents of the page if it was clean. The HWPOISON patch provides the OS handler and hooks to poison the page (or do whatever needs doing) when the machine check arrives.

HWPOISON

Posted Sep 1, 2009 21:31 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

Ok, I was reading this as something new being implemented at the hardware layer by Intel

HWPOISON

Posted Sep 1, 2009 23:47 UTC (Tue) by jzbiciak (guest, #5246) [Link]

It's both. The hardware now supports a concept of recoverable machine check, and the software uses it.

HWPOISON

Posted Aug 31, 2009 7:28 UTC (Mon) by kleptog (subscriber, #1183) [Link]

The vm.memory_failure_recovery parameter delays the killing: the page is merely unmapped by HWPOISON. When this unmapped page is actually referenced at a later time then a SIGBUS will be sent.

Perhaps this is handled properly, but by just unmapping, arn't you running the risk that some later memory allocation by that process might get the same virtual address and thus instead of a SIGBUS the process keeps running with corrupted memory? ISTM you want to map a known bad page there instead.

HWPOISON

Posted Sep 8, 2009 11:14 UTC (Tue) by robbe (guest, #16131) [Link]

Why is this ALL-CAPS TECHNOLOGY? Was something in the engineers'
infrastructure missing the fifth bits (due to faulty memory perhaps)?

In a more serious vein, I found the article less clear and more hard to
read than the usual material on the kernel page.

ECC is able to recover from multib(i|y)te errors

Posted Dec 4, 2009 9:00 UTC (Fri) by Milan (guest, #26716) [Link] (1 responses)

While single bit data errors can be corrected via ECC, multi-bit data errors cannot.

It depends how log the data and the ECC code are. Longer ECC provides capability to correct (and detect) more bites.

ECC is able to recover from multib(i|y)te errors

Posted Dec 4, 2009 12:51 UTC (Fri) by dlang (guest, #313) [Link]

in theory yes, but you are forgetting that we are talking about a very standardized use of ECC, namely what is implemented in RAM and memory controllers. that implementation only corrects single-bit errors

HWPOISON

Posted Sep 29, 2017 20:24 UTC (Fri) by mcoulter (guest, #118826) [Link] (1 responses)

There are several instances of a link to an Intel document:
http://download.intel.com/design/processor/manuals/253668...

This link is broken. I found a different 253668.pdf but it does not have:
section 15.6, "Recovery of Uncorrected Recoverable Errors"
or 15.6.3, "UCR Error Classification"

Can anyone provide a current link to this manual?

HWPOISON

Posted Sep 29, 2017 22:14 UTC (Fri) by mcoulter (guest, #118826) [Link]

Found it from

Intel 64 and IA-32 Architectures Software Developer's Manuals
https://software.intel.com/en-us/articles/intel-sdm

in the 4 volume set

Combined Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals

https://software.intel.com/sites/default/files/managed/39...

It is in volume 3:
Intel® 64 and IA-32 architectures software developer's manual combined volumes 3A, 3B, 3C, and 3D: System programming guide

https://software.intel.com/sites/default/files/managed/a4...