PostgreSQL and Databases in general: Btree Index

Friday 25 May 2018

Parallel Index Scans In PostgreSQL

There is a lot to say about parallelism in PostgreSQL. We have come a long way since I wrote my first post on this topic (Parallel Sequential Scans). Each of the past three releases (including PG-11, which is in its beta) have a parallel query as a major feature which in itself says how useful is this feature and the amount of work being done on this feature. You can read more about parallel query from the PostgreSQL docs or from a blog post on this topic by my colleague Robert Haas. The intent of this blog post is to talk about parallel index scans which were released in PostgreSQL 10. Currently, we have supported parallel scan for btree-indexes.

To demonstrate how the feature works, here is an example of TPC-H Q-6 at scale factor - 20 (which means approximately 20GB database). Q6 is a forecasting revenue change query. This query quantifies the amount of revenue increase that would have resulted from eliminating certain company-wide discounts in a given percentage range in a given year. Asking this type of "what if" query can be used to look for ways to increase revenues.

explain analyze

select sum(l_extendedprice * l_discount) as revenue
from lineitem

where l_shipdate >= date '1994-01-01' and

l_shipdate < date '1994-01-01' + interval '1' year and

l_discount between 0.02 - 0.01 and 0.02 + 0.01 and

l_quantity < 24
LIMIT 1;

Non-parallel version of plan
-------------------------------------
Limit
-> Aggregate
-> Index Scan using idx_lineitem_shipdate on lineitem
Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01

00:00:00'::timestamp without time zone) AND (l_discount >= 0.01) AND

(l_discount <= 0.03) AND (l_quantity < '24'::numeric))

Planning Time: 0.406 ms
Execution Time: 35073.886 ms

Parallel version of plan
-------------------------------
Limit
-> Finalize Aggregate
-> Gather

Workers Planned: 2

Workers Launched: 2
-> Partial Aggregate
-> Parallel Index Scan using idx_lineitem_shipdate on lineitem
Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01

00:00:00'::timestamp without time zone) AND (l_discount >= 0.01) AND

(l_discount <= 0.03) AND (l_quantity < '24'::numeric))

Planning Time: 0.420 ms
Execution Time: 15545.794 ms

We can see that the execution time is reduced by more than half for a parallel plan with two parallel workers. This query filters many rows and the work (CPU time) to perform that is divided among workers (and leader), leading to reduced time.

To further see the impact with a number of workers, we have used somewhat bigger dataset (scale_factor = 50). The setup has been done using TPC-H like benchmark for PostgreSQL. We have also created few additional indexes on columns (l_shipmode, l_shipdate, o_orderdate, o_comment)

Non-default parameter settings:
random_page_cost = seq_page_cost = 0.1
effective_cache_size = 10GB
shared_buffers = 8GB
work_mem = 1GB

The time is reduced almost linearly till 8 workers and then it reduced slowly. The further increase in workers won’t help unless the data to scan increases.

We have further evaluated the parallel index scan feature for all the queries in TPC-H benchmark and found that it is used in a number of queries and the impact is positive (reduced the execution time significantly). Below are results for TPC-H, scale factor - 20 with a number of parallel workers as 2. X-axis indicates (1: Q-6, 2: Q14, 3: Q18).

Under the Hood

The basic idea is quite similar to parallel heap scans where each worker (including leader whenever possible) will scan a block (all the tuples in a block) and then get the next block that is required to be scan. The parallelism is implemented at the leaf level of a btree. The first worker to start a btree scan will scan till it reaches the leaf and others will wait till the first worker has reached the leaf. Once, the first worker read the leaf block, it sets the next block to be read and wakes one of the workers waiting to scan blocks. Further, it proceeds scanning tuples from the block it has read. Henceforth, each worker after reading a block sets the next block to be read and wakes up the next waiting worker. This continues till no more pages are left to scan at which we end the parallel scan and notify all the workers.

A new guc min_parallel_index_scan_size has been introduced which indicates the minimum amount of index data that must be scanned in order for a parallel scan to be considered. Users can try changing the value of this parameter to see if the parallel index plan is effective for their queries. The number of parallel workers is decided based on the number of index pages to be scanned. The final cost of parallel plan considers the cost (CPU cost) to process the rows will be divided equally among workers.

In the end, I would like to thank the people (Rahila Syed and Robert Haas) who were involved in this work (along with me) and my employer EnterpriseDB who has supported this work. I would also like to thank Rafia Sabih who helped me in doing performance testing for this blog.

Friday 17 March 2017

Hash indexes are faster than Btree indexes?

PostgreSQL have supported Hash Index for a long time, but they are not much used in production mainly because they are not durable. Now, with the next version of PostgreSQL, they will be durable. The immediate question is how do they perform as compared to Btree indexes. There is a lot of work done in the coming version to make them faster. There are multiple ways in which we can compare the performance of Hash and Btree indexes, like the time taken for creation of the index, search or insertion in the index. This blog will mainly focus on the search operation. By definition, hash indexes are O(1) and Btree indexes are O(log n), however with duplicates that is not exactly true.

To start with let us see the impact of work being done to improve the performance of hash indexes. Below is the performance data of the pgbench read-only workload to compare the performance difference of Hash indexes between 9.6 and HEAD on IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM.

The workload is such that all the data fits in shared buffers (scale factor is 300 (~4.5GB) and shared_buffers is 8GB). As we can see from the above graph, that the performance has increased at all client counts in the range of 7% to 81% and the impact is more pronounced at higher client counts. The main work which has led to this improvement is 6d46f478 (Improve hash index bucket split behavior.) and 293e24e5 (Cache hash index's metapage in rel->rd_amcache.).

The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages. In general, acquiring the heavyweight lock is costlier as compare to lightweight locks. In addition to reducing the locking cost, this also avoids locking out scans and inserts for the lifetime of the split.

The second commit 293e24e5 avoids a significant amount of contention for accessing metapage. Each search operation needs to access metapage to find the bucket that contains tuple being searched which leads to high contention around metapage. Each access to metapage needs to further access buffer manager. This work avoids that contention by caching the metapage information in backend local cache which helps bypassing all the buffer manager related work and hence the major contention in accessing the metapage.

The next graph shows how the hash index performs as compared to the btree index. In this run we have changed hash to btree index in pgbench read-only tests.

We can see here that the hash index performs better than the btree index and the performance difference is in the range of 10 to 22%. In some other workloads we have seen a better performance like with hash index on varchar columns and even in the community, it has been reported that there is performance improvement in the range of 40-60% when hash indexes are used for unique index columns.

The important thing to note about the above data is that it is only on some of the specific workloads and it mainly covers Selects as that is the main area where performance improvement work has been done for PostgreSQL10. The other interesting parameters to compare are the size of the index and update on the index which needs more study and experiments.

In the end, I would like to thank my colleagues who were directly involved in this work and my employer EnterpriseDB who has supported this work. Firstly I would like to thank, Robert Haas who has envisioned all this work and is the committer of this work, and Mithun C Y who was the author of commit 293e24e5. Also, I would like to extend sincere thanks to all the community members who are involved in this work and especially Jeff Janes and Jesper Pedersen who have reviewed and tested this work.