Showing posts with label OLAP workloads. Show all posts
Showing posts with label OLAP workloads. Show all posts

Wednesday 19 February 2020

Parallelism, what next?

This blog post is about the journey of parallelism in PostgreSQL till now and what is in store for the future.  Since PostgreSQL 9.6 where the first feature of parallel query has arrived, each release improves it.  Below is a brief overview of the parallel query features added in each release.

PG9.6 has added Parallel execution of sequential scans, joins, and aggregates.

PG10 has added (a) Support parallel B-tree index scans, (b) Support parallel bitmap heap scans, (c) Allow merge joins to be performed in parallel, (d) Allow non-correlated subqueries to be run in parallel, (e) Improve ability of parallel workers to return pre-sorted data and (f) Increase parallel query usage in procedural language functions.

PG11 has added (a) Allow parallel building of a btree index, (b) Allow hash joins to be performed in parallel using a shared hash table, (c) Allow parallelization of commands CREATE TABLE ... AS, SELECT INTO, and CREATE MATERIALIZED VIEW, (d) Allow UNION to run each SELECT in parallel if the individual SELECTs cannot be parallelized, (e) Allow partition scans to more efficiently use parallel workers, (f) Allow LIMIT to be passed to parallel workers, this allows workers to reduce returned results and use targeted index scans, (g) Allow single-evaluation queries, e.g. WHERE clause aggregate queries, and functions in the target list to be parallelized.

PG12 has added Allow parallelized queries when in SERIALIZABLE isolation mode.

The progress for PG13 with respect to parallelism.  Some of the important advancements are:
(a) Parallel vacuum - This feature allows the vacuum to leverage multiple CPUs in order to process indexes.  This enables us to perform index vacuuming and index cleanup with background workers.  This adds a PARALLEL option to VACUUM command where the user can specify the number of workers that can be used to perform the command which is limited by the number of indexes on a table.  Specifying zero as a number of workers will disable parallelism.  For more information, see commit.

(b) Improve EXPLAIN's handling of per-worker details.  This allows displaying the worker information in a much better way.  The few visible side-effects as mentioned in the commit

* In text format, instead of something like

  Sort Method: external merge  Disk: 4920kB
  Worker 0:  Sort Method: external merge  Disk: 5880kB
  Worker 1:  Sort Method: external merge  Disk: 5920kB
  Buffers: shared hit=682 read=10188, temp read=1415 written=2101
  Worker 0:  actual time=130.058..130.324 rows=1324 loops=1
    Buffers: shared hit=337 read=3489, temp read=505 written=739
  Worker 1:  actual time=130.273..130.512 rows=1297 loops=1
    Buffers: shared hit=345 read=3507, temp read=505 written=744

you get

  Sort Method: external merge  Disk: 4920kB
  Buffers: shared hit=682 read=10188, temp read=1415 written=2101
  Worker 0:  actual time=130.058..130.324 rows=1324 loops=1
    Sort Method: external merge  Disk: 5880kB
    Buffers: shared hit=337 read=3489, temp read=505 written=739
  Worker 1:  actual time=130.273..130.512 rows=1297 loops=1
    Sort Method: external merge  Disk: 5920kB
    Buffers: shared hit=345 read=3507, temp read=505 written=744

(c) Avoid unnecessary shm writes in Parallel Hash Join.  This improves the performance of Parallel Hash Join by a significant amount on large systems running many-core joins.  Though this work has been back-patched to v11 where Parallel Hash Join was introduced, I mentioned it here as it is done during PG13 development.  For more information, see commit.

What is being discussed for the future:
(a) Parallel grouping sets - PostgreSQL already supports parallel aggregation by aggregating in two stages. First, each process participating in the parallel portion of the query performs an aggregation step, producing a partial result for each group of which that process is aware. Second, the partial results are transferred to the leader via the Gather node. Finally, the leader re-aggregates the results across all workers in order to produce the final result.

Next, there has been a discussion in the community to parallelize queries containing grouping sets in much the same way as we do parallel aggregation.
Basically, the aim is to parallelize queries like SELECT brand, size, sum(sales) FROM items_sold GROUP BY GROUPING SETS ((brand), (size), ());
This feature has been proposed for PG13, but yet not committed.

(b) Parallel copy - We also want to parallelize the Copy command, in particular "Copy <table_name> from .. ;" command.  This will help improve the bulk load operation in PostgreSQL.  Currently, we do a lot of work during the Copy command.  We read the file in 64KB chunks, then find the line endings and process that data line by line, where each line corresponds to one tuple.  We first form the tuple (in form of value/null array) from that line, check if it qualifies the where condition and if it qualifies, then perform constraint check and few other checks and then finally store it in local tuple array.  Once we reach 1000 tuples or consumed 64KB (whichever occurred first), we insert them together and then for each tuple insert into the index(es) and execute after row triggers.  The aim of this work is to parallelize as much as possible the work done during the copy.  There is an ongoing discussion in the community on this topic.

There is a small caveat here that to achieve parallel copy, we need to work on relation extension lock where parallel workers block each other while extending the relation which is not the case currently.  There is already a discussion on this topic in the community.

(c) Parallel file_fdw - The proposed work in this area allows file_fdw to divide its scan up for parallel workers, much like a parallel seq scan.

There are more areas where parallelism can be used like parallel DMLs (inserts, updates, deletes).  During a discussion with Thomas Munro, it came up that it would be beneficial if we can parallelize index creation and index scans for indexes other than btree especially gin and gist.  Note that we already support parallel index scans and parallel index creation for btree. I would not like to go in detail of these operations as till now we haven't seen any proposal for those.  Similarly, we can improve few things in our current parallel infrastructure (a) As of now, for each query the parallel workers are created and destroyed, instead we can have some pool of parallel query workers which can avoid the overhead of starting them for each query, (b) As of now, each worker can use up to work_mem of memory which might increase the overall memory usage of query.  We might want to improve this, but currently, there is no proposal for this.

Friday 25 May 2018

Parallel Index Scans In PostgreSQL


There is a lot to say about parallelism in PostgreSQL. We have come a long way since I wrote my first post on this topic (Parallel Sequential Scans). Each of the past three releases (including PG-11, which is in its beta) have a parallel query as a major feature which in itself says how useful is this feature and the amount of work being done on this feature. You can read more about parallel query from the PostgreSQL docs or from a blog post on this topic by my colleague Robert Haas. The intent of this blog post is to talk about parallel index scans which were released in PostgreSQL 10. Currently, we have supported parallel scan for btree-indexes.

To demonstrate how the feature works, here is an example of TPC-H Q-6 at scale factor - 20 (which means approximately 20GB database). Q6 is a forecasting revenue change query. This query quantifies the amount of revenue increase that would have resulted from eliminating certain company-wide discounts in a given percentage range in a given year. Asking this type of "what if" query can be used to look for ways to increase revenues.

explain analyze
select sum(l_extendedprice * l_discount) as revenue
          from lineitem
          where l_shipdate >= date '1994-01-01' and
          l_shipdate < date '1994-01-01' + interval '1' year and
          l_discount between 0.02 - 0.01 and 0.02 + 0.01 and
          l_quantity < 24
          LIMIT 1;

Non-parallel version of plan
-------------------------------------
Limit
-> Aggregate
    -> Index Scan using idx_lineitem_shipdate on lineitem
         Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
         00:00:00'::timestamp without time zone) AND (l_discount >= 0.01) AND
         (l_discount <= 0.03)  AND  (l_quantity < '24'::numeric))
Planning Time: 0.406 ms
Execution Time: 35073.886 ms

Parallel version of plan
-------------------------------
Limit
-> Finalize Aggregate
    -> Gather
         Workers Planned: 2
         Workers Launched: 2
          -> Partial Aggregate
               -> Parallel Index Scan using idx_lineitem_shipdate on lineitem
                    Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01 
                    00:00:00'::timestamp without time zone) AND (l_discount >= 0.01) AND
                    (l_discount <= 0.03) AND (l_quantity < '24'::numeric))
Planning Time: 0.420 ms
Execution Time: 15545.794 ms

We can see that the execution time is reduced by more than half for a parallel plan with two parallel workers. This query filters many rows and the work (CPU time) to perform that is divided among workers (and leader), leading to reduced time.

To further see the impact with a number of workers, we have used somewhat bigger dataset (scale_factor = 50). The setup has been done using TPC-H like benchmark for PostgreSQL. We have also created few additional indexes on columns (l_shipmode, l_shipdate, o_orderdate, o_comment)

Non-default parameter settings:
random_page_cost = seq_page_cost = 0.1
effective_cache_size = 10GB
shared_buffers = 8GB
work_mem = 1GB




The time is reduced almost linearly till 8 workers and then it reduced slowly. The further increase in workers won’t help unless the data to scan increases.

We have further evaluated the parallel index scan feature for all the queries in TPC-H benchmark and found that it is used in a number of queries and the impact is positive (reduced the execution time significantly). Below are results for TPC-H, scale factor - 20 with a number of parallel workers as 2. X-axis indicates (1: Q-6, 2: Q14, 3: Q18).


Under the Hood
The basic idea is quite similar to parallel heap scans where each worker (including leader whenever possible) will scan a block (all the tuples in a block) and then get the next block that is required to be scan. The parallelism is implemented at the leaf level of a btree. The first worker to start a btree scan will scan till it reaches the leaf and others will wait till the first worker has reached the leaf. Once, the first worker read the leaf block, it sets the next block to be read and wakes one of the workers waiting to scan blocks. Further, it proceeds scanning tuples from the block it has read. Henceforth, each worker after reading a block sets the next block to be read and wakes up the next waiting worker. This continues till no more pages are left to scan at which we end the parallel scan and notify all the workers.

A new guc min_parallel_index_scan_size has been introduced which indicates the minimum amount of index data that must be scanned in order for a parallel scan to be considered. Users can try changing the value of this parameter to see if the parallel index plan is effective for their queries. The number of parallel workers is decided based on the number of index pages to be scanned. The final cost of parallel plan considers the cost (CPU cost) to process the rows will be divided equally among workers.

In the end, I would like to thank the people (Rahila Syed and Robert Haas) who were involved in this work (along with me) and my employer EnterpriseDB who has supported this work. I would also like to thank Rafia Sabih who helped me in doing performance testing for this blog.