Robert Haas

Thursday, June 24, 2010

PostgreSQL as an In-Memory Only Database

There's been some recent, interesting discussion on the pgsql-performance mailing list on using PostgreSQL as an in-memory only database. In other words, you basically want to use it as a cache, similar to the way that you would use memcached or a NoSQL solution, but with a lot more features.

If you're interested in doing this, you'll need to configure the system so that you have a convenient, automatic way erase the database cluster and reinitialize it (using initdb) after an operating system crash. Per discussion on the mailing list, for best performance, it seems best to set up the data directory on a tmpfs and configure the following parameters in postgresql.conf:

fsync=off
synchronous_commit=off
full_page_writes=off
bgwriter_lru_maxpages=0

With fsync=off, and most likely also with full_page_writes=off, your database will not be crash-safe - but you don't care, because you're planning to start from scratch after a crash anyway. If you're familiar with postgresql.conf parameters, setting synchronous_commit=off might seem redundant if you've already set fsync=off, but testing reveals that it still boosts performance. Turning off full_page_writes and bgwriter_lru_maxpages eliminates I/O that isn't needed for this use case.

On a related note, Gavin Roy gave a talk at PGCon comparing the performance of PostgreSQL with fsync=off with a number of NoSQL databases. The results were pretty good, but there might even be room for improvement with some additional tuning.

If you end up testing a configuration along these lines, please post a comment here or on the pgsql-performance mailing list with your experiences.

Wednesday, June 23, 2010

Working Toward PostgreSQL 9.0 Beta3

We are gradually creeping toward the release of PostgreSQL 9.0, but there's still a ways to go. We're continuing to get several bug reports per week about problems in 9.0 beta 2 - many thanks to all those who have tested and reported bugs. Here are some of the issues we've resolved in CVS since beta2:

- Fix a problem that could cause checkpoints to happen too infrequently when using streaming replication, with certain combinations of settings (Fujii Masao).
- Fix quoting problems in EXPLAIN (FORMAT YAML) output (Dean Rasheed).
- Fix a problem that could cause a "cache lookup failed for type %d" error when using PL/python (Tom Lane).
- Change pg_last_xlog_receive_location() and pg_last_xlog_replay_location() to return NULL instead of 0/0 when they do not apply (Heikki Linnakangas).
- Rename restartpoint_command to archive_cleanup_command, to clarify what it's for (Itagaki Takahiro).
- Allow replication connections to use a "replication" entry in .pgpass (Fujii Masao).
- Fix the newly added vacuumdb -Z option (Josh Berkus).
- Have pg_upgrade create its output files in a less surprising location (Bruce Momjian).
- Fix ALTER LARGE OBJECT and GRANT ... ON LARGE OBJECT to not break when an OID too large to be represented by a signed integer is used (Robert Haas).
- Avoid entering a tight loop when streaming replication hits a corrupt WAL record (Heikki Linnakangas).
- New contrib module for use as an archive_cleanup_command (Simon Riggs).
- Adjust GUC categories to match the docs (Itagaki Takahiro).
- Deprecate the use of => as an operator name, and remove or rename the new => operators in 9.0, so we can eventually use this for the purpose the SQL standards committee has in mind (Robert Haas).
- Revert apparently-useless code to add symbol table entries for PL/perl functions (Andrew Dunstan).
- Avoid calling malloc(0) in pg_upgrade (Bruce Momjian).
- Don't allow WAL to be streamed to the standby until it has been fsync'd on the master - otherwise, a master crash can effectively corrupt the standby database (Fujii Masao).
- Fix pg_upgrade problems on Win32 (Bruce Momjian).
- Various documentation improvements.

If you haven't yet beta-tested PostgreSQL 9.0, there's no time like the present! Ideally, load up your production data, run your production application against it, and let us know whether anything breaks. Or, pull the plug a few times and see if anything goes wrong; or try to break Streaming Replication or Hot Standby. This is shaping up to be a great release - but it's not done yet.

Sunday, June 13, 2010

Making PostgreSQL Faster

Although we've made great progress in speeding up PostgreSQL over the last few years, there's always more to be done. Performance, with PostgreSQL as with any other database, is largely determined by the availability of three resources: CPU, memory, and disk. What could we do to use each of these resources most efficiently?

PostgreSQL is already pretty efficient at using the CPU. For high-concurrency databases, I don't anticipate that things will get much better than they already are. For low-concurrency databases, we need parallel query - that is, the ability to use more than one CPU to process the same query.

Memory is a little bit more of a problem. We do a good job keeping our memory footprint small, but we don't manage it terribly well. work_mem limits the maximum size of a sort or hash, but takes no account of current conditions: if the system is swapping due to memory pressure, you get the same plan as if the system has 40GB of free memory. And all the memory allocated to shared_buffers remains allocated even when it isn't truly needed.

I/O is perhaps the biggest problem. I don't think this problem is unique to PostgreSQL - I believe all databases probably share this pain point to some degree. Disks are slow. With respect to PostgreSQL specifically, there are a number of things we need to do to minimize our I/O bandwidth, including index-only scans and further improvements to VACUUM. Partial vacuum (implemented in 8.4) is a pretty big deal, but there's more that needs to be done.

We also need to put more effort into minimizing our on-disk format and WAL volume. The actual disk space is cheap, but the time needed to read and write a larger volume of data hurts performance.

Tuesday, June 08, 2010

Why Join Removal Is Cool

When people talk to me about the (limited implementation of) join removal that will be part of PostgreSQL 9.0, the conversation usually goes in two ways. Some people ask how the feature works and then say something like "oh, I guess that could be useful every once in a while". Other people already know exactly how the feature works and usually say some variant of "this is an amazingly wonderful feature that I am looking forward to with great enthusiasm".

The difference between these two groups of people (I think) is not so much their level of technical knowledge or how closely they've been following pgsql-hackers, but their use case. If your database is primarily a data warehouse, my guess is that you won't have many occasions to benefit from join removal. Where this feature really comes in handy is in OLTP workloads with highly normalized data, in situations where users are generating queries against views (perhaps through some sort of reporting interface) and expecting to get results back immediately.

Let's take an example. Suppose you're writing a bug-tracking system. Each bug has a number of properties associated with it: who reported it, who's working on it, current status, priority, date opened, release for which it's slated to be fixed, date of last status change, date resolved, people who want to get an email when it's updated, comments, etc. If like me you're a big fan of database normalization, you'll not want to store all of these as text fields. So you might end up with a table like this:

CREATE TABLE bug (
id serial,
reporter_id integer not null references person (id),
assigned_to_id integer references person (id),
status_id integer not null references bug_status (id),
priority_id integer not null references priority (id),
target_release_id integer references release (id),
open_date date not null,
...
primary key (id)
);

You'll probably also end up with some supplementary tables for the items that can exist multiple times, like bug_comment and bug_watchers. Now, to make reporting easier, you'll probably want to define a view over the bug table that joins to all the other tables, so that it's easy to get the text values for the reporter, status, etc.

CREATE VIEW bug_view AS
SELECT
b.id,
b.reporter_id, r.name AS reporter,
b.assigned_to_id, at.name AS assigned_to,
b.status_id, s.name AS status,
b.priority_id, p.name AS priority,
b.target_release_id, tr.name AS target_release,
b.open_date,
...
FROM
bug b
JOIN person r ON b.reporter_id = r.id
JOIN bug_status s ON b.status_id = s.id
JOIN priority p ON b.priority_id = p.id
LEFT JOIN person at ON b.assigned_to_id = at.id
LEFT JOIN release tr ON b.target_release_id = tr.id;

And now you can pretty easily write an engine that will let users select the columns they'd like to see from bug_view and the filter conditions they'd like to apply (only open bugs, only bugs slated to be resolved in release X, etc.) via a spiffy web interface. Note that the reporter, bug status, and priority fields can't be null, so we can use a plain old JOIN, but the bug might be assigned to no one or have no target release, so we use LEFT JOIN in those cases. (Otherwise, rows where those fields were NULL would not appear in the output.)

Over time, you'll tend to add more fields. Scalar fields like open_date don't cause much heartache, but as you add more fields that require joins, your view will tend to slow down. Some people might say that the answer is simply to denormalize - use natural keys in the bug table, and don't join. While that solution may be appropriate for some people, it is not without its downsides: database normalization was invented for a reason. The good news is that PostgreSQL is fast and has an excellent query planner, so even fairly complex queries run quite quickly. The bad news is that every query against the view is going to hit every table that's part of the view definition, so if you add enough of them, it's eventually going to be slow.

And, realistically, most of the time, users aren't going to want all the columns anyway. In a web application, 8-12 columns of output in an HTML table is typically about as much as you can squeeze in without starting to have a lot of line wrapping issues. This leads pretty naturally to the following question: if you don't need all of the columns, can you skip some of those joins and speed up the query?

Yes. In PostgreSQL 9.0, we can drop a join against a base table if (1) it's a left join, (2) there is a unique index on all or a subset of the join columns, and (3) none of the attributes from the nullable side of the join are used elsewhere in the query. So, in the above example, we could skip the joins to person at or release tr if the assigned_to or target_release columns, respectively, are not selected, assuming those tables have unique indexes on their id columns (if they don't, the join might change the number of rows in the output, so we must perform it).

We can't skip joining to any of the other tables, because those are inner joins. That's an implementation restriction which I hope will be lifted in PostgreSQL 9.1, but some more logic is needed to make that safe. In the meantime, a useful workaround may be to write those joins as LEFT JOINs rather the INNER JOINs, in cases where either join type will produce the same results.

Monday, May 24, 2010

An Excellent Developer Meeting

I'm really pretty fired up about the results of our PGCon 2010 Developer Meeting. Of course, the list of what everyone plans to work on is pretty interesting, and if we got even a fraction of those features we'd have a pretty awesome release. But that's not really what got me fired up. What I'm excited about is some of the new and innovative thinking on replication and clustering - or, at any rate, it was new to me.

Two concepts in particular stand out for me. First, we discussed the ability to give replication solutions a crash-recoverable view into transaction commit order, a point which Jan Wieck has since expanded into a full-blown design proposal. Jan believes that this infrastructure will be useful not only for his own project, Slony, but also for other replication solutions such as Londiste which also operate by replaying transactions in commit order. As I understand it, one of the major advantages of this approach is that it eliminates the need for a global shared counter to track the order of writes (which becomes a bottleneck). Instead, they can be tagged with their order within the top-level transaction, and then the transactions as a whole can be ordered using the transaction commit ordering information.

Second, I was very interested in our discussion of a global transaction manager, for which I unfortunately do not have a good link for further reading. One possible way of avoiding cross-node serialization anomalies in a distributed database environment is to have a single node which knows about all in-flight transactions and hands out snapshots that are coherent across the entire cluster. Postgres-XC takes this approach, but there might be value in integrating something like this into core PostgreSQL. We might imagine allowing one PostgreSQL instance to be configured as a "snapshot provider" and another instance to subscribe to it. Right now, it's not clear that there's enough benefit to core PostgreSQL from accepting a patch along these lines, but there are several ways that might change as our distributed computing capabilities improve. For example, if we had a significant SQL/MED implementation, we'd need to think about how to do serialization correctly across multiple nodes; there might also be applications as we work to expand the capabilities of Hot Standby.

If your eyes are glazing over at this point, you're probably not alone. These features are fairly esoteric. Still, I think the fact that we're starting to seriously talk about this topics and consider integrating some of them into core shows that we're starting to understand better what the real needs are for replication and clustering. As our understanding of those needs continues to improve, I expect to see more capabilities in core PostgreSQL, but perhaps even more importantly, an even stronger set of tools around core PostgreSQL that will make it progressively easier to scale horizontally. I don't expect this to happen overnight, but I feel like we're moving in the right direction.

Thursday, May 20, 2010

Global Temporary and Unlogged Tables

From a technical standpoint, temporary tables in PostgreSQL have three properties that distinguish them from ordinary tables:

1. They're stored in a special schema, so that they are normally visible only to the creating backend.
2. They are managed by the local buffer manager rather than the shared buffer manager.
3. They are not WAL-logged.

It makes sense to think about removing these properties one by one, in the order listed above. Removing just the first property, without doing anything else, doesn't quite make sense, because a table which is managed by the local buffer manager can't be simultaneously accessed by multiple backends. We could work around this by having each backend access a separate set of files. This would give us a global temporary table - that is, a table which is visible to everyone, but each backend sees its own contents. (There is some debate about whether this is the right name, or what the right name for this concept might be - but that's what I'm calling it for now.)

Removing both of the first two properties also makes sense. It gives us an unlogged table - that is, a basically ordinary table for which no WAL is written. (Again, the naming is debatable.) Such tables are not crash-safe: an unexpected system crash could leave the table hopelessly corrupted. The only obvious workaround for this problem is to truncate the table on every system restart.

Why might someone want these new table types? Global temporary tables are appealing for users who need temporary tables with a relatively fixed structure, and don't want to recreate them in every new session. In addition to administrative convenience, this avoids the overhead of repeatedly creating and vacuuming the system catalog entries associated with the temporary tables, which may be a performance benefit for some users.

Unlogged tables are appealing for data that needs to be shared across backends, but which we're willing to lose in the case of a server restart. For example, consider a web application maintaining a table of active user sessions. If the server restarts, we may be willing to lose this data. Everyone will need to log in again, but considering that database crashes are rare, that may not be such a big deal. Unlogged tables also won't be replicated to standby servers, since replication relies on WAL. But, on the plus side, skipping WAL-logging should hopefully yield a significant performance benefit.

I'm going to be working on implementing both of these table types for PostgreSQL 9.1. In each case, the hardest part seems to be making sure that we clean up properly after a crash or server restart.

Monday, May 10, 2010

Lots and Lots of PostgreSQL Feature Requests

I was surprised and pleased to see that my last blog post, concerning possible directions for future PostgreSQL development, got about five times as many page views as my previous five posts put together, and a total of 70 comments (to date). This may be partly because it got syndicated on LWN, where a few more comments were also posted. I've gone through the comments posted on the blog itself and on the LWN article and counted up the number of times each feature was mentioned. Of course, this is totally unscientific, but it matches up fairly well to the results of previous surveys and gives me an excuse to talk about a bunch of interesting features.

1. Materialized Views (12). See my previous post on Materialized Views in PostgreSQL. Long story short, we may quite possibly get a simple version of this in PostgreSQL 9.1, but I suspect a lot more work will be needed to meet some of the use cases people have in mind.

2. Multi-master replication (6). This is more or less the holy grail of database geeks; it's a pretty hard problem. As it turns out, there is a project in the works called Postgres-XC which does just this. I hope to learn more about this project next week at PGcon. My understanding is that it currently supports only a subset of the SQL query types supported by PostgreSQL, but that work is underway to remove these limitations. Currently, none of the Postgres-XC code can be considered for including in core PostgreSQL because it uses a different license (LGPL), but it's still very interesting as an independent project.

3. Index-organized tables and/or index-only scans and/or automatic maintenance of CLUSTER order (6). I've grouped all of these features together because they're really driving toward the same underlying goal: reducing the I/O cost of an index scan. PostgreSQL will most likely not implement index-organized tables in the sense that Oracle has them, wherein, as I understand it, the table data is stored in the leaf pages of an index. However, we probably will implement index-only scans, which will allow us to gain some of the same performance benefits. Automatic maintenance of CLUSTER order would help, too, but I am not aware that anyone is currently working on that project.

4. MERGE (6). There is a (very ambitious) Google Summer of Code project to implement this feature. Stay tuned. If you're curious about what this feature actually does, I think Simon Riggs has written the best description of MERGE I've seen so far, together with some discussion of the implementation issues. A few weeks later he discussed he followed up with some further notes on the design of the feature.

5. Partitioning syntax (5). Itagaki Takahiro proposed a patch to implement this for PostgreSQL 9.0, but we simply ran out of time. I am hopeful that this will come back to life and be improved and eventually committed for PostgreSQL 9.1. The patch as proposed would have allowed the use of dedicated syntax to specify partitions and would have automatically created appropriate CHECK constraints for each partition, but would not have automatically routed INSERTs to the proper partition, which I think is necessary to make this really useful. Of course, others may have different opinions. :-)

6. Parallel query execution (5). This is a very good idea and yet also a very hard problem; I am not aware that anyone has even proposed a possible design for this, yet alone attempted to implement it. If we implement better SQL/MED support, it might be possible to get some of the benefits of this feature by spreading out data across multiple nodes but making it all appear local by creating remote tables. Or, it might be possible to leverage some of the I/O bandwidth of remote nodes by adding a feature to support non-local tablespaces (with some kind of specialized daemon process reading and writing remote pages on request). But neither of these are spot-on: what we really want is the ability to parallelize a query on a single node.

7. Procedures that can BEGIN, COMMIT, or ABORT transactions (4). This is another feature that would be great to have, but I am not aware that anyone is currently working on it.

8. Granular collation support (4). There are really two halves to this project. The SQL standard specifies a set of complex rules for determining which collation should be used for a particular comparison or ORDER BY operation. In PostgreSQL, you could imagine setting the collation for a particular setting using the "SET" command; associating a collation with a particular column; or overriding the collation for a particular instance of ORDER BY or a particular use of the < operator. So, one half of this problem is simply being able to recognize which collation applies in a particular context and doing the sort or comparison under that collation. The other half of the problem is extending our indexing system to handle multiple collations - either the ability to create an index with a particular collation (which can then be used to satisfy queries that pertain to that collation) or even the ability to create a single index which can somehow answer queries pertaining to multiple collations.

9. Better drivers (4). JDBC was mentioned twice, and asynchronous drivers were mentioned twice. It was also suggested that we should have an "official" Python driver. As a project, we've generally been wary about endorsing other projects, perhaps to our detriment. But good things seem to be happening with the psycopg2 project, especially the recent license change.

10. Graph database features (3). At least one person observed that you can do some of what is wanted here using common table expressions, also known as WITH queries, which are supported beginning in PostgreSQL 8.4. But several people seem to feel that we should have more graph support; one poster mentioned algorithms such as all-pairs shortest paths. I don't have a clear idea of what is needed here, but I suspect that some of the things people are looking for here could be implemented as PostgreSQL extensions. It would be interesting to see some more detailed requirements.

11. Documentation (3). I was surprised to see several requests for documentation among the comments, since I have generally found the PostgreSQL documentation to be superb and one of the great strengths of the project. But one poster did hit on an issue which I think is entirely legitimate: if you Google something like "PostgreSQL documentation", you get a link to our main documentation. But if you Google "PostgreSQL ALTER TABLE", you get a link to the documentation for ALTER TABLE in PostgreSQL 8.1, whereas you might hope to get a link to the 8.4 version of the documentation for that command. And if you Google "PostgreSQL setting", well, let's just say you don't get a link to a page that tells you how to change PostgreSQL settings. If you actually go to the documentation page and navigate through it manually, it's quite easy to find what you're looking for, but there must be something about our site that make Google fail to grok it properly.

Still another poster was looking for documentation in Chinese. Currently, I believe that we maintain documentation only in English due to the rather large translation effort that would be involved in keeping documentation up to date in multiple languages. In fact, we currently don't even ship Chinese translations of our error messages, due to the fact that our existing set of translations is too incomplete. If you would like to help localize PostgreSQL for your native language, please see our wiki page on NLS. Volunteers are needed!