I have ran some benchmark tests to see the Write performance/scalability in
PostgreSQL 9.5 and thought it would be good to share the same with others,
so writing this blog post.
so writing this blog post.
I have ran a pgbench tests (TPC-B (sort of) load) to compare the performance
difference between different modes and scale factor in HEAD (e5f455f5) on
IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM
and here are the performance results
Some of the default settings used in all the tests are:
min_wal_size=15GB
max_wal_size=20GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
autovacuum=off
I have kept auto vacuum as off to reduce the fluctuation due to same and is
dropping and re-creating the database after each run. I have kept high values
of min_wal_size and max_wal_size to reduce the effect of checkpoints, probably
somewhat lower values could have served the purpose of this workload, but I
haven't tried it.
The data is mainly taken for 2 kind of modes (synchronous_commit = on | off) and
at 2 different scale factors to cover the cases when all the data fits in shared buffers
(scale_factor = 300) and when all the data can't fit in shared buffers, but can fit in
First lets talk about synchronous_commit = off case, here when all the data fits in
shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count
with TPS being approximately 75 percent higher at 64 client-count as compare to 8
client count which doesn't look bad. When all the data doesn't fit in shared buffers,
but fit in RAM (scale_factor = 3000), we can see scalability upto 32 client-count with
TPS being 64 percent higher than at 8 client-count and then it falls there on.
One major difference in case of Writes when data doesn't fit in shared_buffers is
that backends performing transactions needs to write the dirty buffers themselves
when they are not able to find a clean buffer to read the page, this can hamper
the TPS.
Now let's talk about synchronous_commit = on case, here when all the data fits in
shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count
with TPS being approximately 189 percent higher at 64 client-count as compare to
8 client count which sounds good. When all the data doesn't fit in shared buffers,
but fit in RAM (scale_factor = 3000), we can see a pretty flat graph with some
scalability upto 16 client-count with TPS being approximately 22 percent higher than
at 8 client-count and then it stays as it is.
Here one point to note is that when the data fits in shared_buffers (scale_factor = 300),
TPS at higher client-count (64) in synchronous_commit = on mode becomes equivalent to
TPS in synchronous_commit = off which suggests that there is no major contention
due to WAL writing in such loads.
In synchronous_commit = on case, when the data doesn't fit in shared_buffers
(scale_factor = 3000), the TPS is quite low and one reason is that backends might
be performing writes themselves, but not sure if the performance is so low just
due to that reason as I have tried with different values of Bgwriter related parameters
(bgwriter_delay, bgwriter_lru_maxpages, bgwriter_lru_multiplier), but there is no
much difference.
As per my knowledge, the locks that can lead to contention for this workload
are:
a. ProcArrayLock (used for taking snapshot and at transaction commit)
b. WALWriteLock (used for performing WALWrites)
c. CLOGControlLock (used to read and write transaction status)
d. WALInsertLocks (used for writing data to WAL buffer)
I think among these ProcArrayLock and WALWriteLock are the candidates
which can be the reason for contention, but I haven't done any deep analysis
to find out the same.
Now it could be that the bottleneck is due to multiple locks as was the case
for read operations which I have explained in my previous Read Scalability
blog or it could be due to one of these locks. I think all this needs further
analysis and work. Thats all what I want to say for now.