Parallelism is now reality in PostgreSQL. With 9.6, I hope we will see many
different form of queries that can use parallelism to execute. For now, I will
limit this discussion to what we can already do, which is Parallel Sequential
Scans.
Parallel Sequential Scans are used to scan a relation parallely with the help of
background workers which in turns improve the performance of such scans. I
will discuss about the scenarios where user can expect a performance boost
due to this feature later in this blog, but first let us understand the basic feature
and how it works. Three new GUC parameters have been added to tune the
usage of this feature.
max_parallel_degree - This is used to set the maximum number of workers that
can be used for an individual parallel operation. It is very well possible that the
requested number of workers are not available at execution time. Parallel workers
are taken from the pool of processes established by max_worker_processes which
means that value of max_parallel_degree should be lesser than max_worker_processes.
It might not be useful to set the value of this parameter more than the number of CPU
count on your system.
parallel_tuple_cost - This is used by planner to estimate the cost of transferring a
tuple from parallel worker process to master backend. The default is 0.1. The more
the number of tuples that needs to be passed from worker backend processes to
master backend process, the more this cost will be and more overall cost of
parallel sequential scan plan.
parallel_setup_cost - This is used by planner to estimate the cost of launching parallel
worker processes and setting up dynamic shared memory to communicate.
The default is 1000.
Now let us see the simple example to demonstrate how parallel sequential scan works:
create table tbl_parallel_test(c1 int, c2 char(1000));
insert into tbl_parallel_test values(generate_series(1,1000000),'aaaaa');
Analyze tbl_parallel_test;
Explain analyze select * from tbl_parallel_test where c1 < 10000 and
c2 like '%bb%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on tbl_parallel_test
(cost=0.00..157858.09 rows=1 width=1008)
(actual time=378.414..378.414 rows=0 loops=1)
Filter: ((c1 < 10000) AND (c2 ~~ '%bb%'::text))
Rows Removed by Filter: 1000000
Planning time: 0.075 ms
Execution time: 378.431 ms
(5 rows)
Set the max parallel degree to enable the use of parallelism in queries.
set max_parallel_degree = 6;
Explain analyze select * from tbl_parallel_test where c1 < 10000
and c2 like '%bb%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..29701.57 rows=1 width=1008)
(actual time=182.708..182.708 rows=0 loops=1)
Number of Workers: 5
-> Parallel Seq Scan on tbl_parallel_test
(cost=0.00..28701.47 rows=1 width=1008)
(actual time=179.496..1081.120 rows=0 loops=1)
Filter: ((c1 < 10000) AND (c2 ~~ '%bb%'::text))
Rows Removed by Filter: 1000000
Planning time: 0.078 ms
Execution time: 200.610 ms
(7 rows)
Here, we can see how changing max_parallel_degree allows the usage of parallel workers
to perform parallel sequential scans. We can notice in above example that even though we
have set max_parallel_degree as 6, still it uses 5 workers and the reason for same is that
currently the parallel workers are choosen based on size of relation.
Next, let us discuss about usage of functions in parallel query. A new clause PARALLEL
is added to the CREATE FUNCTION statement. There are three valid values that can be
used by user with this clause.
1. PARALLEL Unsafe - This indicates that the function can't be executed in parallel mode
and the presence of such a function in a SQL statement forces a serial execution plan.
2. PARALLEL Restricted - This indicates that the function can be executed in parallel mode,
but the execution is restricted to parallel group leader. As of now, if the qualification for any
particular relation has anything that is parallel restricted, that relation won't be chosen for
parallelism.
3. Parallel Safe - This indicates that the function is safe to run in parallel mode without
restriction.
The default value for function is PARALLEL Unsafe.
Now let us see the impact of using Parallel Safe and Unsafe function in the queries. I will
continue using the query used in previous example to explain the concept.
Create a Parallel Safe function
create or replace function calc_factorial(a integer, fact_val integer)
returns integer
as $$
begin
perform (fact_val)!;
return a;
end;
$$ language plpgsql PARALLEL Safe;
Use it in query Explain analyze select * from tbl_parallel_test where
c1 < calc_factorial(10000, 10)
and c2 like '%bb%';
QUERY PLAN
--------------------------------------------------------------------------------
Gather (cost=1000.00..75154.99 rows=1 width=1008)
(actual time=120566.456..120566.456 rows=0 loops=1)
Number of Workers: 5
-> Parallel Seq Scan on tbl_parallel_test
(cost=0.00..74154.89 rows=1 width=1008)
(actual time=119635.421..359721.498 rows=0 loops=1)
Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10)))
Rows Removed by Filter: 1000000
Planning time: 54.904 ms
Execution time: 120622.631 ms
(7 rows)
Here we can see that Parallel Plan is chosen and the parallel safe function
is pushed to workers for evaluation of quals.
Now lets change that function as Parallel Unsafe and see how the above
query behaves.
Alter Function calc_factorial(integer, integer) PARALLEL Unsafe;
Explain analyze select * from tbl_parallel_test where
c1 < calc_factorial(10000, 10)
and c2 like '%bb%';
QUERY PLAN
--------------------------------------------------------------------------------
Seq Scan on tbl_parallel_test
(cost=0.00..407851.91 rows=1 width=1008)
(actual time=33166.138..33166.138 rows=0 loops=1)
Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10)))
Rows Removed by Filter: 1000000
Planning time: 0.162 ms
Execution time: 33166.208 ms
(5 rows)
So using parallel unsafe functions in queries would lead to serial plans.
Next, let us see the Performance characteristics of Parallelism:
Non-default settings used to collect performance data:
shared_buffers=32GB; min_wal_size=5GB; max_wal_size=10GB
checkpoint_timeout =30min; max_connections=300;
max_worker_processes=100;
Test setup
create table tbl_perf(c1 int, c2 char(1000));
insert into tbl_perf values(generate_series(1,30000000),'aaaaa');
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial($1,10) and
c2 like '%aa%';
The function calc_factorial is same as used in previous example and the values passedto it are such that the desired percentage of rows can be selected. Example
--"to select 1% of rows, below query can be used"
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(29700000,10) and
c2 like '%aa%';"
--"to select 10% of rows, below query can be used"
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(27000000,10) and
c2 like '%aa%';"
--"to select 25% of rows, below query can be used"
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(22500000,10) and
c2 like '%aa%';"
Performance Data -1. With increase in degree of parallelism (more parallel workers), the time to complete
the execution reduces.
2. Along with workers, master backend also participates in execution due to which you
can see more time reduction in some cases.
3. After certain point, increasing max parallel degree won't help.
The cases we have seen in this blog are mostly the cases where parallel query helps by
using the workers, however there exists some cases like when qualification is very cheap
where it hurts or won't help even by employing more number of workers. There is
more investigation needed to make sure that planner won't choose such plans for parallelism.
Just curious, what is the relationship to CPU cores here? Presumably your system under test has 64 cores, commensurate with the DOP being tested?
ReplyDeleteThe point about CPU cores is that if we employ workers more than CPU count in system for doing parallel query, it might not help because the workers will not be able to run parallely considering most of the work is CPU-intensive.
DeleteAs far as the system under test is concerned, it has more than 64 CPU count, the reason for showing the test with max_parallel_degree
as 64 is not related to CPU count, rather it was just to show the
behaviour of parallel scans at different values of parallel degree.
The non-parallel, "PARALLEL UNSAFE" function query is almost 4x faster than the parallel/safe version... What's going on there? plpgsql issue?
ReplyDeleteThanks for your work on this! Curious to see what else can be parallel in 9.6 (sorts, joins, ...). :-)
I have done that test on my local Windows m/c without appropriate
Deletesettings in postgresql.conf. The purpose of that test was just to
show the usage of Parallel Safe functions with parallel query.
Anyway today, I have done the tests on the Power-8 m/c with
appropriate settings (shared_buffers=32GB) and the results shows
that Parallel Safe version took less than half the time as compare to
Parallel Unsafe version. Results of tests are as below:
postgres=# Explain analyze select * from tbl_parallel_test where c1 < calc_factorial(10000, 10) and c2 like '%bb%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..75155.28 rows=1 width=1008) (actual time=1532.915..1532.915 rows=0 loops=1)
Number of Workers: 5
-> Parallel Seq Scan on tbl_parallel_test (cost=0.00..74155.18 rows=1 width=1008) (actual time=1528.242..9175.564 rows=0 loops=1)
Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10)))
Rows Removed by Filter: 1000000
Planning time: 0.078 ms
Execution time: 1562.325 ms
(7 rows)
Alter Function calc_factorial(integer, integer) PARALLEL Unsafe;
postgres=# Explain analyze select * from tbl_parallel_test where c1 < calc_factorial(10000, 10) and c2 like '%bb%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl_parallel_test (cost=0.00..407853.49 rows=1 width=1008) (actual time=4901.296..4901.296 rows=0 loops=1)
Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10)))
Rows Removed by Filter: 1000000
Planning time: 0.096 ms
Execution time: 4901.383 ms
(5 rows)