If you haven't followed progress in this area closely, I highly recommend that you read the Built-in Sharding page which my colleague Bruce Momjian wrote up for the PostgreSQL wiki in December of 2016 as well as the very interesting slides which Etsuro Fujita, Kyotaro Horiguchi, Masahiko Sawada, and Amit Langote presented at PGCONF.ASIA 2016. (Note that the atomic commit feature mentioned in that presentation did not make it into PostgreSQL 11.)
One way to look at sharding is as a form of partitioning where the partitions might happen to be foreign tables rather than local tables. From that point of view, the fact that PostgreSQL 11 made huge improvements in the area of partitioning is very significant. Most of those improvements are just as relevant for sharding as they are for partitioning, and some of them are considerably more beneficial for sharding. In particular, hash partitioning is very useful for dividing data evenly across a set of remote servers, and partition-wise aggregate allows the aggregate pushdown capability added in PostgreSQL 10 to work with partitioned tables.
One of the most intimidating remaining problems is that the PostgreSQL executor can't run queries against multiple foreign partitions at the same time. Because of all the work that has been done over the last several years on join pushdown, aggregate pushdown, partition-wise join, and partition-wise aggregate, it's possible in PostgreSQL 11devel to get a query plan that looks (approximately) like this:
Append
-> Foreign Scan
Relations: Aggregate on orders1 INNER JOIN order_lines1
-> Foreign Scan
Relations: Aggregate on orders2 INNER JOIN order_lines2
-> Foreign Scan
Relations: Aggregate on orders3 INNER JOIN order_lines3
Relations: Aggregate on orders3 INNER JOIN order_lines3
This is a big improvement over what was possible just a few years ago. The join in the original query has been decomposed into joins between the matching pairs of partitions, and likewise the aggregate has been made to happen on a per-partition basis. Then, the join and aggregate has been pushed to the remote side. That is great. What is not so great is that the foreign scans will be executed consecutively, not concurrently. Several attempts have been made to solve this problem, but those designs have either handled only a subset of the important cases or carried a significant risk of slowing down the executor in general. My colleague Andres Freund is now working on this problem, and believes that he can solve this problem in a general way while speeding up the executor.
Apart from that problem, I think there are basically three major areas that need work:
1. More pushdown. The example above shows a case where we manage to push nearly all of the work to the remote side, but there are plenty of important cases where that won't happen, such as a join of a sharded table to a small local table. Today, we'll pull the remote data to the local node, but we really ought to consider pushing the local table to the remote node (or using a replicated copy that already exists there). Similarly, partition-wise aggregate won't succeed in pushing the aggregate to the remote side unless the partition key matches the grouping key; we should consider performing a partial aggregate on the remote side. Work done over the last several years has made quite a few of the simple cases work, but there is a lot more to do. There is also other performance work which needs to be done, such as speeding up bulk loading of postgres_fdw foreign tables, as well as infrastructure improvements to improve plan quality and planning speed in cases involving foreign tables.
2. ACID. Right now, if a server fails while committing a write transaction that involves multiple nodes (via postgres_fdw or some other foreign data wrapper), you might end up with the transaction committed on some nodes but not others. There's a patch for that, but it didn't make it into PostgreSQL 11. Similarly, we'd really like to have MVCC snapshots that span the entire cluster, so that if you start and commit a transaction T1, a subsequently-started transaction T2 will either see T1 as committed on every node or as not-committed on every node. Several attempts have been made to solve this problem - most recently by Stas Kelvich. I think these are quite difficult problems, and many of the solutions proposed to date have notable disadvantages, but I don't think the problems are intractable. Another, related area that also needs attention is global deadlock detection.
3. Provisioning and Management. It will be nice to be able to set up partitioned tables with foreign partitions, create matching tables on the remote side, configure some sort of global transaction manager to provide MVCC semantics, and run queries that are well-optimized and return results quickly. However, getting it all working may be more manual than many users would prefer. Some thought needs to be given to making it easier to set up and configure, and also to issues such as backup and failover in a distributed environment.
I don't think it's possible to identify a strict priority order between these items. A few years ago, I took the position that pushdown capabilities were more important than anything else. A system that lacks good provisioning tools and ACID semantics may still get used if it is fast enough; a system that is too slow will not be useful regardless of anything else.
In light of the development already completed, though, the situation seems less clear. If you happen to be using queries that are well-handled by the pushdown capabilities in v11, further development in that area may not seem like a high priority, but if not, it may seem like the top priority. Similarly, if you are mostly running queries against static data, or against a data set where there are new bulk loads but no modification of existing data, or if your alternative is a NoSQL solution that doesn't even deliver ACID semantics on a single node, then you might not care very much about ACID semantics across nodes; but if you're running a highly transactional workload that relies on those semantics for correctness, you probably care a great deal. Finally, if you're an expert PostgreSQL DBA, or have some really good tools to help you out, provisioning and management may not seem unreasonably difficult; if you're new to PostgreSQL, or at least to this area, you may have no idea where to start.
I don't believe that anyone has a complete vision of what a built-in sharding feature for PostgreSQL should look like just yet. Certainly, I don't. At some point, perhaps a point not too far from now, the project may reach a point where such a vision is absolutely indispensable to further progress. It's not too soon to start thinking about what that vision ought to be. At the same time, even without such a vision, a great deal of progress has been made. I think we are close to achieving critical mass, and I'm looking forward to seeing what the future holds.
Is it possible to take parts of Postgres-XL into core PostgreSQL like the Global Lock Manager? Talking from my ignorance here.
ReplyDeleteIt might be, but I haven't studied that issue, so I'm not sure how much work would be required.
DeleteAnything with name "Global" can potentially destroy scalability.
Delete(Thats a lesson I learned from Oracle world)
Thank you for the nice overview of the plans.
ReplyDeleteI am also surprised Postgres-XL (or XC) has not been mentioned (ok, it's mentioned in the linked wiki) though, as these forks have been around for quite some time and exposed number of issues that with standard fdw approach are rather lead to a limited usability, i.e. lack of cooperation between the shards will likely make certain queries not to scale (i.e. correlated subqueries, certain star schema queries)?
Obviously XL code will need more work but there have been significant steps to bring it much closer to the core Postgres over the last two years. Also XL committers are also PG committers so there's potential of synergy there that does not seem to be considered for exploitation in the plans you outlined?
Do you consider taking some parts (or lessons) from XL or you rather plan writing everything from scratch or really going to base all work on independent shards, and not grid/distributed queue-like cooperation between the shards to work more efficiently?
I think that whether and how much gets taken from XL is really up to what individual patch authors want to do in specific cases. I certainly think we should learn whatever lessons we can from the XC/XL/X2 experience, but whether it's easier to take code or to just take ideas is something that needs to be determined by whoever is doing the work in particular cases. I don't have a position that there is only one right way to do it; I think it's just a question of which approach seems like it will produce the best results.
DeleteThanks Robert, nice to see more features taking us closer and closer to having sharding natively in postgres. And great to hear its no more topic of debate.
DeleteWish to mention ideas can also be seeked from Greenplum Database, open source massively parallel, sharding enabled postgres based data platform. Global deadlock detector was recently implemented and committed to Greenplum master.