Robert Haas

Thursday, July 22, 2010

Best Patches of 9.1CF1

Although PostgreSQL 9.0 isn't out yet, we began the first CommitFest for PostgreSQL 9.1 development on July 15, 2010. Our goal is to review every patch submitted by then before August 15. While we're only a week into the CommitFest, I already have some favorite patches: none of which are committed yet, so they might die, get withdrawn, changed, etc. But here they my top picks.

1. Simon Riggs wrote a very nice patch to reduce the lock level required for various DDL statements. We haven't yet come up with clearly workable ideas for allowing multiple DDL statements to execute on the same table at the same time, but what this patch will do is allow certain DDL commands to run in parallel with DML. Some versions of ALTER TABLE will lock out everything (as they all do, presently), some will lock out INSERT/UPDATE/DELETE/VACUUM statements but allow SELECT to run in parallel, and some will only lock out concurrent DDL and VACUUM operations (like ALTER TABLE ... SET WITHOUT CLUSTER). This should be really nice for anyone administering a busy database.

2. My employer, EnterpriseDB, asked me to write a patch that reduces the size of most numeric values on disk. This was based on a design proposal from Tom Lane a few years ago, and turned out to be pretty simple to code up. Currently, our numeric data type always has a 4-byte header specifying the weight of the first digit and display scale. For the values people typically do store, that's overkill. This patch allows a 2-byte header to be used opportunistically, when we can cram everything in; but the old format can still be understood, so it doesn't break pg_upgrade. It'll be interesting to see whether people can see a noticeable change in the size of their disk footprint when this patch is used. And maybe we could even get by with a 1-byte header sometimes... but that's a thought for another patch.

3. Kevin Grittner posted a patch to implement true serializability. I haven't studied the code in detail, and I'm not sure how soon we can hope to see this committed, but it's pretty cool. Our current serialization techniques are pretty good, but this should be a whole lot better whose application logic relies heavily on the absence of serialization anomalies.

Wednesday, July 07, 2010

Distributed Serialization Anomalies

One of the more difficult responsibilities of a database is to provide you with the illusion that transactions on the system are executed sequentially, one after another, while in fact allowing as much parallelism as possible. PostgreSQL's MVCC implementation does this using "snapshots": each statement (or, if you choose the serializable isolation level, each transaction), upon first access to the database, records which transactions have committed as of that moment, and everything it does afterwards will see the effect of those transactions, but not any transactions committed later. (There are some exceptions to this rule when using READ COMMITTED mode with INSERT, UPDATE, or DELETE statements.)

This produces, more or less, the illusion that SQL statements execute sequentially, with each one completing its work before the next one begins. This illusion is extremely important and valuable in many real-world applications. For example, if you transfer money from your checking account to your savings account, a banking application might insert two new rows into the "banking_transactions" table: one to show the debit from checking, and another to show the credit to savings. It wouldn't be good if some other query saw just one of these two new rows: it would look as if the money disappeared from checking but did not appear in savings, or perhaps as if it had appeared in savings without disappearing from checking. You'd be unhappy about the first scenario, and the bank would be unhappy about the second one. This type of scenario is called a serialization anomaly, and databases are responsible for preventing them. In this case, it's pretty easy to make sure this problem can't happen: just do both inserts within a single transaction, and then commit it.

Things get a little trickier when there's more than one database involved. Suppose that I'm moving money from my account (at one branch) to my friend Magnus's account (at a different branch of the same bank). As before, we must make two transaction entries: one showing the debit to my account, and the other showing the credit to his account. We can start transactions on both nodes and do the inserts, but it's not possible to commit both transactions at the very same instant: there could always be a crash after one transaction commits, but before the other one commits.

We can work around this problem to some extent using a protocol called two-phase commit: we'll issue a "PREPARE TRANSACTION" command in both transactions, which should be enough to guarantee that a subsequent "COMMIT PREPARED" command, even after an intervening crash, has no chance of failure. So, we start a transaction on each database, do an insert on each database, prepare both transactions, and then commit both transactions. If there's a crash (or loss of connectivity) after either transaction is prepared but before both transactions are committed, we can still get things back to a consistent state once things are back up again. How? We look to see if either transaction committed; if so, we commit the other one. If not, we see whether both transactions were succesfully prepared; if so, we can commit or abort both; if not, we must abort both.

This solves the problem of making sure that no money can be permanently lost (or created), but there will still be a period of time during which we can see inconsistent views of the system as a whole. Imagine that the bank auditor comes along and runs a report across all bank branches adding up the bank's assets and liabilities. It's possible that he'll query one of the two databases involved in our hypothetical funds transfer before the transaction commits on that node, but by the time he visits the other one, it's committed - therefore he'll see the transferred funds either in both accounts, or in neither one, depending on the order in which he hits the different branches. This is a distributed serialization anomaly.

Distributed serialization anomalies are much harder to avoid than regular serialization anomalies (which are a hard problem all by themselves). One method - which is used by Postgres-XC - is to have a single authority (which Postgres-XC calls a global transaction manager) which hands out snapshots and transaction IDs across all nodes in the cluster; regrettably, there is a potential for this to become a bottleneck, or a single point of failure (see Postgres-XC_Write-Scalable_Cluster.pptx, slides 10 and following).

Unfortunately, there may not be many good alternatives. There is a technology called commitment ordering which seems to have a long paper trail[1] in the academic literature, and which has been studied in relation to MVCC. The good news is that commitment ordering does not require a global coordinator of any kind; each node operates independently and does not even need to know the identities of the other nodes, or even how many exist. It requires no additional communication of any kind. The bad news is that it operates by aborting potentially problematic transactions, and it might end up aborting quite a lot of them. The rule is simply that the serialization order must match the commit order; so if transaction A reads x and writes y, transaction B reads y; and then transaction A commits, the system will abort B (because there could be a read-write dependency cycle between A and B involving another database).

Another alternative is to build up a commit-order dependency graph that spans all the databases involved in the transaction. That is, we imagine a graph with each unaborted transaction as a vertex. If A reads or updates a row and B subsequently updates it, we add an edge from A to B. If A updates a row and B subsequently reads the updated version (or a later version), we also add an edge from A to B. If, at any time, adding an edge to the graph would create a cycle, we abort one of the constituent transactions. Kevin Grittner and Emmanuel Cecchet pointed out a paper by Michael Cahill on this topic[2]; one of the advantages of this approach is that it is possible to prevent all serialization anomalies, which our current approach does not. Kevin and Dan Ports have proposed a patch for 9.1 which would implement true serializability for a single PostgreSQL database, but it's not clear that this would scale well to a distributed system.

[1] e.g. The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous Environment of Multiple Autonomous Resource-Managers, Yoav Raz, 1990 [PDF].
[2] Serializable Isolation for Snapshot Databases, Michael J. Cahill, Uwe Röhm, Alan D. Fekete, 2006 [PDF].

Monday, July 05, 2010

Concurrent Development

PostgreSQL 9.0 beta 3 will be wrapped in the next few days, and at the same time, we'll be branching the tree to begin 9.1 development. This is a new thing for us. In the past, we've waited until the previous release was shipped before opening the tree to new development. However, at the PGCon 2010 development meeting, we decided to try something different this time.

I believe that the primary motivation for this change was that, as we get closer to release, there are fewer and fewer issues to work on, and fewer and fewer people who can be involved in fixing them. So, waiting until release to branch the tree leaves a substantial portion of the developer community sitting idle. A second advantage is that it shortens the time between releases - our tentative plan is to use the same release schedule for 9.1 that we did for 9.0. The first CommitFest for 9.0 began on July 15, 2009, and the first CommitFest for 9.1 will begin on July 15, 2010; the last CommitFest for 9.0 began on January 15, 2010, and the last CommitFest for 9.1 will begin on January 15, 2011. Of course, the actual release date will almost certainly be different, but the plan is for feature freeze to happen about the same time next year that it did this year, so that we can continue to have releases about a year apart.

Of course, the danger of concurrent development is that the work people are doing for 9.1 may distract us from finishing 9.0. Hopefully that won't happen, because I think there is a lot to like about the new process.

Thursday, June 24, 2010

PostgreSQL as an In-Memory Only Database

There's been some recent, interesting discussion on the pgsql-performance mailing list on using PostgreSQL as an in-memory only database. In other words, you basically want to use it as a cache, similar to the way that you would use memcached or a NoSQL solution, but with a lot more features.

If you're interested in doing this, you'll need to configure the system so that you have a convenient, automatic way erase the database cluster and reinitialize it (using initdb) after an operating system crash. Per discussion on the mailing list, for best performance, it seems best to set up the data directory on a tmpfs and configure the following parameters in postgresql.conf:

fsync=off
synchronous_commit=off
full_page_writes=off
bgwriter_lru_maxpages=0

With fsync=off, and most likely also with full_page_writes=off, your database will not be crash-safe - but you don't care, because you're planning to start from scratch after a crash anyway. If you're familiar with postgresql.conf parameters, setting synchronous_commit=off might seem redundant if you've already set fsync=off, but testing reveals that it still boosts performance. Turning off full_page_writes and bgwriter_lru_maxpages eliminates I/O that isn't needed for this use case.

On a related note, Gavin Roy gave a talk at PGCon comparing the performance of PostgreSQL with fsync=off with a number of NoSQL databases. The results were pretty good, but there might even be room for improvement with some additional tuning.

If you end up testing a configuration along these lines, please post a comment here or on the pgsql-performance mailing list with your experiences.

Wednesday, June 23, 2010

Working Toward PostgreSQL 9.0 Beta3

We are gradually creeping toward the release of PostgreSQL 9.0, but there's still a ways to go. We're continuing to get several bug reports per week about problems in 9.0 beta 2 - many thanks to all those who have tested and reported bugs. Here are some of the issues we've resolved in CVS since beta2:

- Fix a problem that could cause checkpoints to happen too infrequently when using streaming replication, with certain combinations of settings (Fujii Masao).
- Fix quoting problems in EXPLAIN (FORMAT YAML) output (Dean Rasheed).
- Fix a problem that could cause a "cache lookup failed for type %d" error when using PL/python (Tom Lane).
- Change pg_last_xlog_receive_location() and pg_last_xlog_replay_location() to return NULL instead of 0/0 when they do not apply (Heikki Linnakangas).
- Rename restartpoint_command to archive_cleanup_command, to clarify what it's for (Itagaki Takahiro).
- Allow replication connections to use a "replication" entry in .pgpass (Fujii Masao).
- Fix the newly added vacuumdb -Z option (Josh Berkus).
- Have pg_upgrade create its output files in a less surprising location (Bruce Momjian).
- Fix ALTER LARGE OBJECT and GRANT ... ON LARGE OBJECT to not break when an OID too large to be represented by a signed integer is used (Robert Haas).
- Avoid entering a tight loop when streaming replication hits a corrupt WAL record (Heikki Linnakangas).
- New contrib module for use as an archive_cleanup_command (Simon Riggs).
- Adjust GUC categories to match the docs (Itagaki Takahiro).
- Deprecate the use of => as an operator name, and remove or rename the new => operators in 9.0, so we can eventually use this for the purpose the SQL standards committee has in mind (Robert Haas).
- Revert apparently-useless code to add symbol table entries for PL/perl functions (Andrew Dunstan).
- Avoid calling malloc(0) in pg_upgrade (Bruce Momjian).
- Don't allow WAL to be streamed to the standby until it has been fsync'd on the master - otherwise, a master crash can effectively corrupt the standby database (Fujii Masao).
- Fix pg_upgrade problems on Win32 (Bruce Momjian).
- Various documentation improvements.

If you haven't yet beta-tested PostgreSQL 9.0, there's no time like the present! Ideally, load up your production data, run your production application against it, and let us know whether anything breaks. Or, pull the plug a few times and see if anything goes wrong; or try to break Streaming Replication or Hot Standby. This is shaping up to be a great release - but it's not done yet.

Sunday, June 13, 2010

Making PostgreSQL Faster

Although we've made great progress in speeding up PostgreSQL over the last few years, there's always more to be done. Performance, with PostgreSQL as with any other database, is largely determined by the availability of three resources: CPU, memory, and disk. What could we do to use each of these resources most efficiently?

PostgreSQL is already pretty efficient at using the CPU. For high-concurrency databases, I don't anticipate that things will get much better than they already are. For low-concurrency databases, we need parallel query - that is, the ability to use more than one CPU to process the same query.

Memory is a little bit more of a problem. We do a good job keeping our memory footprint small, but we don't manage it terribly well. work_mem limits the maximum size of a sort or hash, but takes no account of current conditions: if the system is swapping due to memory pressure, you get the same plan as if the system has 40GB of free memory. And all the memory allocated to shared_buffers remains allocated even when it isn't truly needed.

I/O is perhaps the biggest problem. I don't think this problem is unique to PostgreSQL - I believe all databases probably share this pain point to some degree. Disks are slow. With respect to PostgreSQL specifically, there are a number of things we need to do to minimize our I/O bandwidth, including index-only scans and further improvements to VACUUM. Partial vacuum (implemented in 8.4) is a pretty big deal, but there's more that needs to be done.

We also need to put more effort into minimizing our on-disk format and WAL volume. The actual disk space is cheap, but the time needed to read and write a larger volume of data hurts performance.

Tuesday, June 08, 2010

Why Join Removal Is Cool

When people talk to me about the (limited implementation of) join removal that will be part of PostgreSQL 9.0, the conversation usually goes in two ways. Some people ask how the feature works and then say something like "oh, I guess that could be useful every once in a while". Other people already know exactly how the feature works and usually say some variant of "this is an amazingly wonderful feature that I am looking forward to with great enthusiasm".

The difference between these two groups of people (I think) is not so much their level of technical knowledge or how closely they've been following pgsql-hackers, but their use case. If your database is primarily a data warehouse, my guess is that you won't have many occasions to benefit from join removal. Where this feature really comes in handy is in OLTP workloads with highly normalized data, in situations where users are generating queries against views (perhaps through some sort of reporting interface) and expecting to get results back immediately.

Let's take an example. Suppose you're writing a bug-tracking system. Each bug has a number of properties associated with it: who reported it, who's working on it, current status, priority, date opened, release for which it's slated to be fixed, date of last status change, date resolved, people who want to get an email when it's updated, comments, etc. If like me you're a big fan of database normalization, you'll not want to store all of these as text fields. So you might end up with a table like this:

CREATE TABLE bug (
id serial,
reporter_id integer not null references person (id),
assigned_to_id integer references person (id),
status_id integer not null references bug_status (id),
priority_id integer not null references priority (id),
target_release_id integer references release (id),
open_date date not null,
...
primary key (id)
);

You'll probably also end up with some supplementary tables for the items that can exist multiple times, like bug_comment and bug_watchers. Now, to make reporting easier, you'll probably want to define a view over the bug table that joins to all the other tables, so that it's easy to get the text values for the reporter, status, etc.

CREATE VIEW bug_view AS
SELECT
b.id,
b.reporter_id, r.name AS reporter,
b.assigned_to_id, at.name AS assigned_to,
b.status_id, s.name AS status,
b.priority_id, p.name AS priority,
b.target_release_id, tr.name AS target_release,
b.open_date,
...
FROM
bug b
JOIN person r ON b.reporter_id = r.id
JOIN bug_status s ON b.status_id = s.id
JOIN priority p ON b.priority_id = p.id
LEFT JOIN person at ON b.assigned_to_id = at.id
LEFT JOIN release tr ON b.target_release_id = tr.id;

And now you can pretty easily write an engine that will let users select the columns they'd like to see from bug_view and the filter conditions they'd like to apply (only open bugs, only bugs slated to be resolved in release X, etc.) via a spiffy web interface. Note that the reporter, bug status, and priority fields can't be null, so we can use a plain old JOIN, but the bug might be assigned to no one or have no target release, so we use LEFT JOIN in those cases. (Otherwise, rows where those fields were NULL would not appear in the output.)

Over time, you'll tend to add more fields. Scalar fields like open_date don't cause much heartache, but as you add more fields that require joins, your view will tend to slow down. Some people might say that the answer is simply to denormalize - use natural keys in the bug table, and don't join. While that solution may be appropriate for some people, it is not without its downsides: database normalization was invented for a reason. The good news is that PostgreSQL is fast and has an excellent query planner, so even fairly complex queries run quite quickly. The bad news is that every query against the view is going to hit every table that's part of the view definition, so if you add enough of them, it's eventually going to be slow.

And, realistically, most of the time, users aren't going to want all the columns anyway. In a web application, 8-12 columns of output in an HTML table is typically about as much as you can squeeze in without starting to have a lot of line wrapping issues. This leads pretty naturally to the following question: if you don't need all of the columns, can you skip some of those joins and speed up the query?

Yes. In PostgreSQL 9.0, we can drop a join against a base table if (1) it's a left join, (2) there is a unique index on all or a subset of the join columns, and (3) none of the attributes from the nullable side of the join are used elsewhere in the query. So, in the above example, we could skip the joins to person at or release tr if the assigned_to or target_release columns, respectively, are not selected, assuming those tables have unique indexes on their id columns (if they don't, the join might change the number of rows in the output, so we must perform it).

We can't skip joining to any of the other tables, because those are inner joins. That's an implementation restriction which I hope will be lifted in PostgreSQL 9.1, but some more logic is needed to make that safe. In the meantime, a useful workaround may be to write those joins as LEFT JOINs rather the INNER JOINs, in cases where either join type will produce the same results.