Heikki Linnakangas was doing some benchmarking last week and discovered something surprising: in some circumstances, unlogged tables were actually slower than permanent tables. Upon examination, he discovered that the problem was caused by CLOG contention, due to hint bits not being set soon enough. This leads to a few questions:
1. What is CLOG?
2. What are hint bits?
3. How does setting hint bits prevent CLOG contention?
4. Why weren't hint bits being set sooner?
Let's take those in order.
PostgreSQL uses a data structure called CLOG to keep track of which transaction have committed and which transactions have aborted. For example, if a transaction inserts a large number of tuples and then aborts, we don't immediately get rid of those tuples; VACUUM will do it the next time it processes the table, which might not happen for some time, if there is little other write activity on that table. In the meantime, anyone who looks at those tuples needs to realize that they were inserted by an aborted transaction and are therefore not visible. The canonical source for this information is the CLOG, which stores two bits for each transaction in what is effectively a giant array, indicating whether the transaction is committed, aborted, or apparently still in progress. (The fourth state is called "sub-committed", and is used transiently when committing transactions which contain subtransactions.)
Unfortunately, access to CLOG is relatively expensive. A limited number of CLOG pages can be buffered in memory, but in the worst case accessing CLOG requires a read from disk. Furthermore, because every process in the system potentially needs to know the commit status of transactions which may well all have their status on the same page, lock contention can become a serious problem. To mitigate this problem, once a transaction is known to be committed or aborted, we start setting "hint bits" on the tuples touched by that transaction. For example, in the above-mentioned case of a transaction that inserts some tuples and aborts, once the transaction has aborted, the next processes that access those tuples will look up the transaction ID, see that it has aborted, and set hint bits on the tuples indicating that they were inserted by an aborted transaction. The next process to come along and examine those same tuples will know immediately that they aren't visible, without needing to consult CLOG. Similarly, if the transaction had committed, subsequent processes which examined those same tuples could set hint bits indicating that the insertion transaction committed, which would again save future visitors to the page the trouble of consulting CLOG.
However, hint bits require some special handling when used in combination with another interesting PostgreSQL feature, asynchronous commit. For some applications, such as anything involving money, it's absolutely necessary that when a transaction commits, the commit is durably on disk and can't be lost even if a crash happens immediately afterward. But for other applications, such as monitoring data, it's often OK to lose a few seconds worth of data in the event of a system crash; after all, had the crash happened slightly sooner, you would have lost that data anyway. In such cases, it's often possible to get a big performance boost by setting synchronous_commit=off. Instead of waiting for the commit record to be physically written to disk, we kick off the write in the background (making sure that it will complete within a few hundred milliseconds or so). If the system crashes during that time, the transaction will be lost, but otherwise it will be durably committed.
This design, although very clever and extremely useful in many real-world situations, is problematic for hint bits. Suppose we start setting hint bits for a transaction, saying that it was committed, and then the system crashes. When the system comes back up, the transaction is now considered aborted - but we've got hint bits out there claiming it's committed, so some parts of the transaction may appear committed while other parts appear aborted, or worse, some parts of the system may think the transaction is committed while other parts think it's aborted. That's unacceptable. So in the case of transactions that commit asynchronously, we have to wait until the commit record is durably on disk before we can start setting hint bits. That way, we know that we won't need to reverse course after the fact. In the meantime, that is to say, in the perhaps three-quarters of a second that we're waiting for the commit record to make it to disk, we'll have to forebear setting hint bits; each process that accesses those tuples will need to do its own CLOG lookup to check the transaction status.
That shouldn't be too expensive, right? Three quarters of a second is pretty short.
Well... as it turns out, not always. If those tuples are being updated sufficiently frequently by enough simultaneous processes, it gets pretty expensive. I found that in pgbench tests with scale factor 15, using unlogged tables, I could double performance with at 32 concurrent clients by setting the hint bits immediately rather than waiting for the commit record to hit the disk. And, fortunately, for unlogged and temporary tables, that is safe, since the entire contents of the table (including the bogus hint bits) will be lost anyway on a crash, so my patch to do that is now committed, and (barring yet-to-be-discovered bugs) will appear in PostgreSQL 9.2. If you're hitting this issue on PostgreSQL 9.1, a workaround is to reduce the value of wal_writer_delay, since that will reduce the amount of time it takes for the commit record to make it out to disk.
This problem is particularly acute for unlogged tables because, beginning in PostgreSQL 9.1, any transactions that touch only temporary or unlogged tables always use asynchronous commit, so you can hit this case there even if you have configured synchronous_commit=on. However, if you've configured synchronous_commit=off, you can also encounter this problem with permanent tables. In many cases, the performance benefit you get from not waiting for the commit record to be flushed all the way out to disk will vastly outweigh the downsides of waiting slightly longer to set hint bits - but it would be nice to get the best of both worlds.
I'm going to guess that this would be too hard to co-ordinate internally, but it would be great if we could decide on a table by table basis how much data we'd be willing to risk in a crash. For many projects there are certain tables, orders for example, where I'd want synchronous commits but within the same database there is CMS data where I'd be fine with an asynchronous commit after a couple of seconds. Equally there may be statistical information where I'd be happy to risk 30+ seconds of data if it helped performance.
ReplyDeleteWould such flexibility be either worthwhile or possible in a future version of Postgres?
Chris, I've thought about that.
ReplyDeleteYou can already decide the level of durability you want on a per-transaction, per-session, per-user, or per-database basis using SET LOCAL, SET, ALTER ROLE .. SET, or ALTER DATABASE .. SET to adjust synchronous_commit.
Adding a per-table option would possibly make this more automated. Instead of making it the transaction's job to know what kind of durability it wants, the durability policy would be built into the database: if you've touched only async-commit-ok tables, then you automatically get async commit; otherwise, you get whatever synchronous_commit provides for.
But since this isn't really adding any fundamentally new capability, it's not clear whether it's worth doing. The bookkeeping would have some (probably trivial) overhead even for people who didn't use the feature, so we have to ask whether it would get enough use to justify that overhead, and the time it would take to implement it.
I had the same thought while reading your article. It seems to me that this information does belong into the database; for example, some unlogged table might have a trigger that in some cases writes stuff to a permanent table, turning an apparently innocuous write to an unlogged table into something that requires a sync commit. It doesn't seem reasonable to expect the application developer to cope with this sort of situation.
ReplyDeleteHi Robert,
ReplyDeleteThanks for all the detail. It's great that it can be set on a per transaction basis, but I personally think it would still be a really nice feature if we could set it on a per table basis.
As with foreign keys and referential integrity, this may be something that can be done application side but in my view is always far better to enforce within the DB itself. So if you generally use asynchronous transactions but occasionally want to enforce that updates are handled synchronously then ideally the DB should be able to do this for you.
That way if you have multiple applications or programs that access the same database then you can enforce the same rules across the board. Likewise a change in policy can be implemented once in one place rather than worrying about all the places throughout the code base that need to be kept in sync.
Completely understand if this is a low priority though, or if the performance trade off isn't worth the extra functionality.