KHB: Failure-oblivious computing

June 26, 2006

This article was contributed by Valerie Aurora

[Editors note: this is the second in the Kernel Hacker's Bookshelf series by Valerie Henson; if you missed it, the first article is over here.]

Computer programs have bugs. As programmers, we know that this is inevitable, given the trade-off in time and money against creating a perfect system. Systems with nearly-zero bug counts exist (e.g., the Shuttle software, only 17 bugs in 420,000 lines of code over the last 11 releases) but they require vast amounts of work to achieve this level of correctness, work that is completely unjustifiable for most programs (such as desktop operating systems). But we're programmers, it's our job to replace time and money with smart ideas.

What would happen if when a program had a memory error - and it detected that error, ignored it, and drove happily on, oblivious to the failure? You would expect that this would result in horrible errors and obscure crashes. But what if it worked - or even made things better? For example, failing to check the size of a memory copy operation can result in a buffer overflow attack. Could we do something clever that would both paper over the memory error and keep the application running, more or less on track?

A Solution

Martin Rinard and a few of his colleagues got to wondering about this question and decided to test it - and found that the answer was yes, you can automatically handle memory bugs in a better, safer way than either ignoring the bug or terminating the program. I first heard of their technique, Failure-Oblivious Computing, at their talk at OSDI 2004. The talk was quite lively; if there was a "Most Laughs per Minute" award, Martin Rinard would have won it.

The explanation of how failure-oblivious computing is implemented might seem utterly crazy, but stick with me. Remember, the amazing thing about failure-oblivious computing is that when you implement it, it works! (At least for quite a few useful applications.) The basic idea is to detect memory errors - out-of-bound reads, out-of-bound writes - and instead of killing the program, handle otherwise fatal errors by turning them into relatively benign bugs. Detecting the memory errors requires a "safe-C compiler" - a C compiler that adds run-time memory access checks.

Safe-C compilers (and languages that always check memory accesses) have been around for a long time. When they detect a memory error, the process gets a segmentation fault, and usually exits shortly thereafter. In failure-oblivious computing, the application never even knows the memory error happened. In the case of an out-of-bounds write, the write is silently thrown away and execution continues. Handling out-of-bounds reads is slightly harder. In this case, a made-up value is manufactured and returned.

How do you pick which value to return? Two observations lie behind the answer. First, 0 and 1 are the most common values in computation. Second, sometimes the program is looking for a particular value before returning, such as searching for a particular ASCII character in a string, or iterating through a loop 100 times. The result is a series of return values that looks something like this:

    0, 1, 2, 0, 1, 3, 0, 1, 4,...

So you throw away invalid writes, and make up stuff to return for invalid reads. Crazy, right? But crazy like a fox.

Why does it work?

Failure-oblivious computing is targeted at a particular class of applications, ones with short error-propagation distances - in other words, applications that have relatively short execution paths which return without affecting much global state. This includes a rather useful class of applications, such as web servers, mail servers, and mail readers. It does not include applications like scientific modeling software, in which one wrong value can fatally corrupt the final answer. Software programs which handle incoming requests and return to a waiting state, or have many independent threads of execution are good candidates for failure-oblivious computing.

Another reason failure-oblivious computing works is because memory errors are transformed into input errors. Since the programs have to deal with invalid or malicious input already, often the result is an anticipated error, one the program knows how to deal with cleanly. For example, a buffer overflow attack on Sendmail uses a malformed, too-long, illegal email address to overwrite some other part of the program's memory. This technique silently discards the writes that go beyond the buffer, and Sendmail continues on to check the validity of the input - whether or not it's a correctly formed email address. Answer: No, so throw it away and go on to the next request. At this point, Sendmail is back in known territory and the error has stopped propagating.

A limitation of this technique is the cost of memory bounds checking. Applications that need to access memory frequently will probably not be good candidates for this technique. However, applications that are limited by I/O time, or only need to complete before the human user notices a delay, won't be much impacted by the cost. Indeed, humans can't detect delays below about 100 milliseconds - an eternity in computational time.

Failure-oblivious computing in practice

Rinard and his co-authors evaluated failure-oblivious computing with versions of several commonly used open source applications with known buffer overflow attacks: Sendmail, Pine, Apache, and Midnight Commander. They ran three versions of each program: an unaltered version, one using just safe-C compilation, and one transformed into a failure-oblivious program. In each case, the failure-oblivious version performed acceptably (sometimes better), did not create any new bugs, and did not suffer any security breaches.

One example was the Pine mail reader. It had a bug in processing the "From" field for display in the message index. It needed to add a '\' character in front of certain characters, but allocated a too-small buffer to copy it into. Some "From" fields could overflow the buffer and cause the program to segfault and die. The safe-C version of the program dies as well, because all it can do is detect the buffer overflow. The failure-oblivious version threw away the writes beyond the end of the buffer, and then went on to behave exactly correctly! The length of the "From" field displayed in the index is shorter than the length of the buffer, so the fact that it was truncated too early is unobservable. When the user reads a particular message, a different code path correctly displays the "From" field. Now an email message that would cause Pine to die every time it was started could be correctly displayed and handled.

The performance of failure-oblivious Pine was 1.3 to 8 times slower times on certain tasks, but the total elapsed time to respond to user input was still in the low milliseconds range. For interactive use, the slowdown is acceptable. In the case of the Apache server bug, the performance of the failure-oblivious server was actually better than either of the other two versions. The higher performance was due to the fact that the bug would kill an Apache thread each time it was encountered, incurring the overhead of creating a replacement thread. The failure-oblivious version did not have the overhead of constantly killing and restarting threads and could server requests much faster.

Especially exciting is the use of failure-oblivious computing for widely used network servers, such as Apache and Sendmail. The paper has in-depth examinations of how buffer overflow bugs are prevented and indeed ignored by the failure-oblivious versions of these and other programs.

What failure-oblivious computing means for Linux

Linux has a huge variety of techniques for improving system security in the face of bugs. SELinux, various stack protection schemes, capabilities - all these techniques help cut down but don't eliminate security problems. Failure-oblivious computing would fill one niche, and in some cases will be the best solution due to the ability to continue running after a normally-fatal memory error. Wouldn't it be nice if, when everyone else is suffering from some brand-new zero-day attack, your system is not only secure but still up and running?

More importantly, this paper teaches the value of experimentation with obviously crazy ideas. Even after seeing the talk and reading the paper and talking to the author, I still find it a little mind-boggling that failure-oblivious computing works. Even more fun is understanding why it works - a good reason to read the full paper yourself. I am certain that computers (and computer science) will continue to surprise us for many years to come.

[Do you have a favorite textbook or systems paper? Of course you do. Send your suggestions to:

    val dot henson at gmail dot com

Valerie Henson is a Linux kernel developer working for Intel. Her interests include file systems, networking, women in computing, and walking up and down large mountains. She is always looking for good systems programmers, so send her some email and introduce yourself.]

Index entries for this article
GuestArticles	Aurora (Henson), Valerie

KHB: Failure-oblivious computing

Posted Jun 29, 2006 6:56 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

`A little mind-boggling'? I'll say. It sounds like a very good idea for a large class of applications, and for a large subset of things within even applications which might otherwise not seem to be candidates.

Thanks for drawing my attention to this, even if it *is* breaking my brain.

KHB: Failure-oblivious computing

Posted Jun 29, 2006 10:57 UTC (Thu) by kleptog (subscriber, #1183) [Link]

I agree, it seems just wrong somehow.

OTOH, I can see some useful applications. Imagine if in addition to hiding the error, it logged it somewhere. If the overhead isn't too great, you could just enable this on your server and scan for failure reports daily.

Most of the time people disable core dumps from broken programs because they take up too much space and they need processing to be useful. If just a small amount of specific logging were included, some simple tools could watch for repeated failures and even automatically generate bug reports. All without interrupting service. It almost seems the first step to self-fixing computers.

My brain still hurts though, just thinking about it.

KHB: Failure-oblivious computing

Posted Jun 29, 2006 9:29 UTC (Thu) by job (guest, #670) [Link] (2 responses)

Isn't there a risk that this is effectively bug hiding, so no one will fix them?

KHB: Failure-oblivious computing

Posted Jun 30, 2006 9:22 UTC (Fri) by oak (guest, #2786) [Link]

Note that this is not a general solution for "fixing" buggy programs, but a way to increase reliability of programs in situations where:

Uptime / not crashing
(Performance / speed)

are more important than program working correctly.

This might be the case where the handled data is either:

Redundant
Not written, just read and sent somewhere else (another machine or process)
You don't care about the data as much as of the rest of the service ("good enough" data reliability is satisfactory)

Even in those kind of situations I would assume this feature to be enabled only after the software:

Development phase has ended and SW has been deployed in place(s) where it's hard to update (e.g. set-top boxes)
Has been pretty throughly tested in an environment where similar bugs cause program e.g. to dump core

I would say that for this thing to be generally useful, following should be possible:

Changing the program without re-compiling to terminate/dump core instead
This run-time configurability would still be fast enough

...as I'm pretty sure administrators will still want to be able to debug the problems they will encounter.

The more you value the data the program handles, the less you want it to continue after there's some problem in handling the data. Compare for example a program that manipulates / writes the same data / files constantly (e.g. database server) to a program that acts as a filter for a data that's different each time (e.g. mail server) or doesn't write it at all (e.g. www-server).

KHB: Failure-oblivious computing

Posted Jun 30, 2006 9:40 UTC (Fri) by oak (guest, #2786) [Link]

Btw. an example of a common open source library that by default doesn't
terminate an application which has an error, is Glib (used by Gtk
and many other projects). By default it just logs Glib Warnings
and Critical errors to the console and lets the program hobble along.

In GUI environment end-users don't see these messages as they don't
(usually) start programs from the console. These errors can be turned
to abort() (i.e. program termination) with an environment variable.

The Glib default behavior allows program to corrupt it's internal data
structures e.g. through double-frees. However, the Gnome apps have seemed
to work fairly OK although the developers haven't always had had time to
fix all of those errors, so I guess they had fixed the most problematic
ones before release. ;-)

KHB: Failure-oblivious computing

Posted Jun 29, 2006 11:03 UTC (Thu) by walterh (guest, #19113) [Link] (11 responses)

This seems to me to be a very dangerous idea. After all, we are talking about security critical applications here. The poster before me has already mentioned that papering over bugs will remove the incentive for the developers to fix them. But the problem is actually much more severe: Who says that, say, clipping a buffer that is being overrun by an attacker is a safe choice? This could just as well open new and hard to discover security holes.

Only to make an example, consider ACLs: Take a buffer that is supposed to contain a list of users who are to be denied access to a resource. If the buffer overflows and the program is terminated, then not much harm is done. If the buffer overflows and is clipped, then some people can access the resource that shouldn't.

Or, take the example mentioned in the article, pine: If I sent you a virus infected attachment with filename
loooooooooooooooooo...ooongname.txt.exe
which will overflow the (hypothetical) built in virus scanner in pine, so that it only sees the name
loooooooooooooooooo...ooongname.txt
then the file may go through unscanned. If the save-to-disk routine in pine uses a different buffer length, then the recipient could still save this as an executable.

Walter

KHB: Failure-oblivious computing

Posted Jun 29, 2006 17:19 UTC (Thu) by JoeBuck (subscriber, #2330) [Link] (5 responses)

But the current way that these things are being handled by hardening mechanisms is that the application is terminated (or a thread is terminated) on a bad memory access. That leaves sites vulnerable to denial-of-service attacks.

Certainly it's possible to augment the approach by adding logging, so the fact that an error occurred can be recorded.

KHB: Failure-oblivious computing

Posted Jun 29, 2006 17:48 UTC (Thu) by cventers (guest, #31465) [Link] (4 responses)

Are you responding to someone who is talking about the risk of
unauthorized access and suggesting that the current situation is worse
because it would be a denial of service attack?

This article is very interesting, but I can't say that the technique
described would be something I'd be at all comfortable using. One of the
biggest sources for bugs in software is unpredictable code paths and
input. This method takes the results of a broken assumption and breaks
more assumptions. While this may work OK in certain scenarios, I propose
something fundamentally better (and less expensive?)

What if an invalid memory access simply resulted in an /exception/ versus
a signal? Then I wrap my "parse this RFC2833 junk" function around a try
{} / catch block. If I got an invalid memory access trying to parse the
RFC2833, I write a detailed log entry (hell, if I were uber-clever, I
could use a special 'snapshot my memory' syscall to tell the kernel to
immediately mark my whole state as COW, so it can be dumped to disk. Now
you can do run-time core dumps and continue.). Then I simply deny service
to that operation. The bug is contained. If someone was trying to exploit
it, I've got their IP, and if I'm a programmer I've got a core. If I'm an
administrator, no users complained at all, or perhaps only one if it was
triggered accidentally.

The details, of course, lie in the implementation....

KHB: Failure-oblivious computing

Posted Jun 29, 2006 17:51 UTC (Thu) by cventers (guest, #31465) [Link]

To expand on this a bit, the author alluded to the better performance of
the failure-oblivious Apache due to its workers not dying. The fact that
Apache breaks itself down into workers in this way is an important
reliability feature of Apache -- one that is capable of somewhat avoiding
denial of service attacks.

KHB: Failure-oblivious computing

Posted Jun 29, 2006 17:53 UTC (Thu) by cventers (guest, #31465) [Link] (2 responses)

Silly me, this special syscall even already exists! How about stuffing
this in a SIGSEGV signal handler.

/* Dump some core. */
if (fork() == 0) raise(SIGABRT);

/* Magic to initiate an exception from whatever context we were running
in before our signal handler got called */

KHB: Failure-oblivious computing

Posted Jun 29, 2006 18:09 UTC (Thu) by cventers (guest, #31465) [Link] (1 responses)

I apologize to the readers and editors about the volume of my comments replying to myself, but I think we need a longjmp() that is safe to use as an exit from a signal handler. Then my above code could be:

int handle (int sig)
{
  struct jmp_buf buf;

  if (sig == SIGSEGV) {
     if (fork() == 0) raise(SIGABRT);
     pop_jmp_buf_from_tls_stack(&buf);
     longjmp_after_sig(&buf, sig);
  }
}

...then strategically place:

// Prepare an exception context
if (save_context_in_tls_stack() == SIGSEGV) {
   // operation failed
}

do_something_possibly_unsafe();

// Forget about that context
pop_jmp_buf_from_tls_stack(NULL);

Comments?

KHB: Failure-oblivious computing

Posted Jun 29, 2006 21:34 UTC (Thu) by nix (subscriber, #2304) [Link]

Throwing from certain signal handlers is safe in the presence of -fasynchronous-unwind-tables, but I sort of doubt that SIGSEGV is necessarily one of them (what if, e.g., the unwinder segfaults, something which is not unknown?)

Joe Buck would know for sure, I'd guess. :)

KHB: Failure-oblivious computing

Posted Jun 29, 2006 18:52 UTC (Thu) by evgeny (subscriber, #774) [Link] (4 responses)

> Who says that, say, clipping a buffer that is being overrun by an attacker is a safe choice?

If it is not, the program is severely broken in other way(s) as well, and this could be exploited without the buffer overrun in the first place; so what's your point?

> If I sent you a virus infected attachment with filename
> loooooooooooooooooo...ooongname.txt.exe

Same here. A virus checker that relies on a potentially malicious sender giving a proper file extension is a braindamaged piece of s**t.

KHB: Failure-oblivious computing

Posted Jul 4, 2006 8:42 UTC (Tue) by walterh (guest, #19113) [Link] (3 responses)

>> Who says that, say, clipping a buffer that
>> is being overrun by an attacker is a safe choice?
>
> If it is not, the program is severely broken in other way(s) as well,
> and this could be exploited without the buffer overrun in the first place;
> so what's your point?

My point is that clipping buffers is worse than just terminating the program -- and I gave examples why this is so. Your assertation that you can safely clip buffers is clearly wrong, as is it not hard to think about otherwise perfectly safe programs that get exploitable if you just clip a buffer somewhere.

KHB: Failure-oblivious computing

Posted Jul 4, 2006 9:02 UTC (Tue) by evgeny (subscriber, #774) [Link] (2 responses)

> it not hard to think about otherwise perfectly safe programs that get exploitable if you just clip a buffer somewhere.

Might be; then do think a bit harder to come with a reasonable example proving it; the ones you suggested are absolutely irrelevant. Let's take the first one: say, a buffer is defined as char buf[5] and the attacker manages to pass to it "Hello, world!" causing an overrun. Now, the failure-oblivious runtime notices this and clips the string to just "Hell" (plus the terminating zero). You say when "Hell" propogates further, it might cause a compromise. It could be of course, but my point is that the attacker could send this "Hell" in the first place, and get a successful exploit anyway, whether the buffer overruns are clipped or not. Such things are called a failure to sanitize user input. Do you follow?

KHB: Failure-oblivious computing

Posted Jul 4, 2006 9:29 UTC (Tue) by walterh (guest, #19113) [Link] (1 responses)

> Might be; then do think a bit harder to come with a reasonable example
> proving it; the ones you suggested are absolutely irrelevant.

Just to show you how relevant these examples are, think about the ACL example again: Say a server has a block list which is consists of a dynamic part supplied by the IDS and a static part supplied by the administrator. Joe Hacker has been put on the static part by the administrator. To get around the block, he injects forged packets into the IDS to grow the dynamic block list to the point where the buffer for the combined block list overruns. Now, as the buffer contains just IP addresses this would result in a program crash enabling a denial of service attack. But, when the buffer just gets clipped so that the static part of the block list is gone, Joe Hacker can get access.

Or, another example: A PHP app checks user+password combinations with the following SQL command:

SELECT * FROM users WHERE user='%s' AND password='%s';

Of course, it will escape and special characters like ' in input. Now Joe Hacker supplies
looooo....ongname'
as username. The ' at the end gets escaped to \'. Assume that due to some buffer magic the last character is clipped. Now the string is
looooo....ongname\
Choosing
or 1=1 limit 1;
for the password the SQL command now becomes

SELECT * FROM users WHERE user='looooo....ongname\' AND password=' or 1=1 limit 1;';

which will let Joe hacker in.

So you can see that is is easy to come up with examples where the technique advertised in the article creates gaping security holes.

To be honest, I think it is very worrying how some people here defend a broken idea. If those are the ones writing critical applications, then there is plenty or reason to be afraid.

KHB: Failure-oblivious computing

Posted Jul 4, 2006 11:11 UTC (Tue) by evgeny (subscriber, #774) [Link]

> Say a server has a block list which is consists of a dynamic part supplied by the IDS and a static part supplied by the administrator.

Well, any seasoned C programmer will notice that this is a very strange coding practice:
1) Why to concatenate the lists instead of checking first the static and then the dynamic part?
2) If the concatenation is for a reason indeed necessary, one'd do it the other way around, thus simplifying the compile-time optimization and run-time memory access patterns.

These alone would suggest the prorgam is badly written and perhaps shouldn't be trusted anyway.

However, a more seriously wrong assumption you made is that in the absence of the buffer clipping _any_ buffer overrun results in SEGFAULT. Alas, this is NOT the case. Otherwise, the issue of the buffer overruns would be considered only in the context of DoS. Instead, we're talking about executing practically anything, including the shell access (which is arguably much worse than failure to throwing a bad guy away right at the entrance). And even if that doesn't happen, an equivalent of the buffer clipping could take place. E.g.,

int main(void)
{
char buf[5];
sprintf(buf, "Hello, world!\n");
printf(buf);
exit(0);
}

works nicely here (i.e., outputting the entire "Hello, world!\n" string), instead of segfaulting. However, were there enough of code between the sprintf() and printf() lines, the result could be very different, including in theory the clipping right after the "Hell" substring, or any other type of garbling.

> Or, another example: A PHP app

All strings in PHP are dynamic... You should use a different language.

> checks user+password combinations with the following SQL command:
>
> SELECT * FROM users WHERE user='%s' AND password='%s';

Oh, just let's not return to this deadly beaten horse. You can read e.g. this nice article (http://lwn.net/Articles/185813/) and the very informative comments to it to understand that constructing an SQL statement directly from the user input is 99.999% garanteed to be vulnerable, even in languages that don't have the string buffer overrun problem.

Your examples above show signs of programming ignorance. On the contrary, the buffer overruns are periodically spotted in very mature pieces, written by very respectable programmers. They're often typos (or off-by-one miscalculations) than real mistakes. I wish libc had functions to deal with strings in a safe manner. sigh...

> To be honest, I think it is very worrying how some people here defend a broken idea.

Nobody said this is a silver bullet. As practically any other technique, it has a restricted area of applicability. (E.g., as mentioned in the original article, it certainly shouldn't be used in the scientific computing).

KHB: Failure-oblivious computing

Posted Jun 30, 2006 11:34 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Did anyone else envision the broken program as Mr. Magoo when reading this? You know, a well meaning program that bungles along because it can't see that it's doing so?

KHB: Failure-oblivious computing

Posted Jul 1, 2006 20:22 UTC (Sat) by freealter (guest, #4335) [Link]

Just to say that this article is excellent, informative, entertaining, really fine.

KHB: Failure-oblivious computing

Posted Jul 2, 2006 21:59 UTC (Sun) by ebiederm (subscriber, #35028) [Link]

Ugh. This is an old pattern and has been around for years. Typically
it is seen as ignoring the return values from your functions.

It is certainly worth having in the tool of tricks but as other have
mentioned it doesn't encourage bugs to be fixed, and therefore problems
happen.

The only real way to cope with something like this and keep running
is to handle the errors. SIGSEGV and other signals can be caught.
c++ sytle exceptions can also be caught. Allowing a case by case decision
to be made to ignore what is happening.