Robert Haas: Absurd Shared Memory Limits

Thursday, June 28, 2012

Absurd Shared Memory Limits

Today, I fixed a problem. Or at least, I think I fixed it. Time will tell. But Thom Brown seems pretty happy, and so does Dan Farina. So let me tell you about it. Here's the executive summary: assuming the patch I committed today holds up, PostgreSQL 9.3 will largely eliminate the need to fiddle with operating system shared memory limits.

A PostgreSQL database cluster involves multiple processes - one per session plus a few extras - that need to be able to communicate via a shared memory segment. For historical reasons, most UNIX-like operating systems provide at least three different methods of creating a shared memory segment. In the beginning, there was System V shared memory. Then, a bunch of people got together and decided that they didn't like the System V shared memory interface very much, so they created a new interface called POSIX shared memory which did mostly the same thing - but not quite. Both System V shared memory and POSIX shared memory involve creating named shared memory segments (though they use different naming conventions); to attach to one of these segments, you identify it by name. At some point along the way, it became possible to create shared memory segments in yet another way using a system call named mmap() and passing it the options MAP_SHARED and MAP_ANONYMOUS. Shared memory segments created this way don't have a name, so they can only be shared between a parent process and its descendents.

PostgreSQL uses System V shared memory, because it provides a feature that is available via neither of the other two systems: the ability to atomically determine the number of processes attached to the shared memory segment. When the first PostgreSQL process attaches to the shared memory segment, it checks how many processes are attached. If the result is anything other than "one", it knows that there's another copy of PostgreSQL running which is pointed at the same data directory, and it bails out. This is really good, because having two PostgreSQL processes pointed at the same data directory at the same time is a sure way to corrupt your database.

Despite the fact that System V shared memory is the only available shared memory implementation that provides this feature, most operating system vendors frown on it, and encourage users to use the newer POSIX shared memory facilities instead. They do this by limiting the amount of System V shared memory that can be allocated by default to absurdly small values, often 32MB. On some older systems, you actually had to recompile the kernel to raise the limit; thankfully, on modern systems, it's usually as simple as editing /etc/sysctl.conf and running sysctl -f /etc/sysctl.conf to read the update settings. Still, it poses a needless obstacle for new PostgreSQL users, who now have a choice between (1) terrible database performance and (2) fiddling with kernel settings that they don't understand.

In my opinion, the decision to tightly limit the amount of System V shared memory that can be allocated is a poor one on the part of OS vendors. POSIX shared memory and mmap's anonymous shared memory have much higher limits, or none at all; and as far as I can see, limiting System V shared memory makes things inconvenient for users of programs like PostgreSQL without any compensating advantage.

The good news is that the above-mentioned commit contains a workaround. We allocate a very small System V shared memory segment (48 bytes, on the systems I tested; it could vary slightly by platform) which provides the interlock to prevent multiple instances of PostgreSQL from attaching to the same data directory at the same time, and allocate a large anonymous shared memory block for everything else. Assuming the patch doesn't get reverted for one reason or another, this means that in PostgreSQL 9.3 it will be possible to start PostgreSQL on all platforms I'm familiar with - using an arbitrarily high shared_buffers setting - without any adjustment of default operating system limits. That should hopefully make things easier for first-time users.

Here's a link to the thread on pgsql-hackers, for those wanting to read more.

39 comments:

Jon JensenJune 28, 2012 9:24 PM
That's excellent, Robert. If this works out it will save people a lot of grief in coming years!
ReplyDelete
Replies
UnknownJune 28, 2012 9:36 PM
It sounds very good. I hate when small projects depend on custom kernel settings. With large projects you usually have to do some kernel tuning, but if your project is on a single node machine and not rocket science, it must just work out of the box.
ReplyDelete
Replies
vdpJune 29, 2012 6:12 AM
Nice one :)

Is the performance profile the same for all types of shared memory ?
ReplyDelete
Replies
AnonymousJune 29, 2012 10:16 AM
It seems the small limit was put in place to encourage developers to stop using it for some reason. It worked in this case.
ReplyDelete
Replies
Peter van HardenbergJune 29, 2012 12:34 PM
<3
ReplyDelete
Replies
RobJune 29, 2012 4:43 PM
LIKE

This really is one of the more annoying issues to explain about PostgreSQL.
ReplyDelete
Replies
Pedro LarroyJune 29, 2012 5:33 PM
can't a semaphore or an exclusive lock be used for this very same purpose?
ReplyDelete
Replies
AnonymousJune 29, 2012 5:45 PM
Hmm. Am I missing something?

Why not just use the single memory mapped segment using mmap, and use the first long word as an atomic counter? Your operations on that counter must, however, be atomic.

Unless I've got something wrong here - you won't need to have System V shared memory at all anymore.

Is it a portability issue?
ReplyDelete
Replies
AnonymousJune 29, 2012 11:12 PM
Keep up the awesome work Robert!
ReplyDelete
Replies
Josh KupershmidtJune 29, 2012 11:31 PM
I'm quite happy about this change, thanks Robert for seeing it through! I've often made the mistake of "fixing" a PG server which wouldn't start due SHMMAX/SHMALL by using only sysctl -w ..., only to have the same problem a few months later after the machine reboots itself, because I neglected to update sysctl.conf as well, or made a typo in sysctl.conf.

And your post is actually the first time I've seen the tip of using sysctl -f /etc/sysctl.conf, instead of the steps our docs recommend for Linux. Hrmph, the -f flag isn't even in my man page for sysctl.
ReplyDelete
Replies
Alexander PyhalovJune 30, 2012 3:50 AM
Thank you, it's really annoying task - to set this limits correctly in jailed or OpenVZ environments...
ReplyDelete
Replies
Nicolas GrillyJune 30, 2012 11:02 AM
Thank you Robert. That's a very nice improvement, especially for new users.
ReplyDelete
Replies
Davi ArnautJuly 02, 2012 2:09 AM
On error, mmap() returns MAP_FAILED. Also, mmap() can potentially return NULL, so AnonymousShmem should be initialized with and checked against MAP_FAILED.
ReplyDelete
Replies
glyphJuly 06, 2012 8:12 PM
I have seen these complaints before (about locking the data directory) but I have never really understood them.

Why bother with a tiny SysV shared memory chunk at all? Why not just use a filesystem lock with either flock() or symlink() like... pretty much every other daemon process in the universe?

I'm not saying I'm sure there's no good reason, I've just never seen one mentioned in the previous mailing list threads.
ReplyDelete
Replies
AnonymousJuly 06, 2012 11:29 PM
Given that SysV and POSIX shared memory chunks have named identifiers, I'm not sure why you'd compare them to mmap() with MAP_ANONYMOUS. The logical equivalent is an mmap()'d file on the actual file system, not an anonymous chunk of mmap()'d memory.
ReplyDelete
Replies
UnknownSeptember 01, 2012 10:54 PM
Hmm. Any chance of a fast 9.3 release? =)
ReplyDelete
Replies
sjgSeptember 10, 2012 9:08 PM
SysV shared memory has a fixed overhead, as do many other things in the kernel. The higher you raise those limits, the higher your fixed overhead becomes. The reason we as OS vendors do not ship with the ability to use many gigabyte SysV shared memory segments by default (historically) is that few people use it and we do not want to put the burden of that fixed overhead on everyone who does not need it.

This patch will reduce performance outright on BSD kernels for users who previously leveraged the shm_use_phys optimization (pretty much everyone who runs a serious database) because the kernel will have to manage pv entries for all of those mmap'd pages. It will also create additional memory pressure on those systems because more pv entries will need to be allocated.
ReplyDelete
Replies
feldSeptember 10, 2012 9:11 PM
On FreeBSD you have to enable shared memory for jails if you want to jail your postgres process, and you can't have more than one postgres instance in a jail. It's always been considered insecure because the system-v shared memory data is readable by all jails. Is this going to solve the FreeBSD jails problem?

http://lists.freebsd.org/pipermail/freebsd-jail/2008-January/000149.html
ReplyDelete
Replies
sjgSeptember 10, 2012 9:16 PM
There is a very good reason we OS vendors do not ship with SysV default limits high enough to run a serious PostgreSQL database. There is very little software that uses SysV in any serious way other than PostgreSQL and there is a fixed overhead to increasing those limits. You end up wasting RAM for all the users who do not need the limits to be that high. That said, you are late to the party here, vendors have finally decided that the fixed overheads are low enough relative to modern RAM sizes that the defaults can be raised quite high, DragonFly BSD has shipped with greatly increased limits for a year or so and I believe FreeBSD also.

There is a serious problem with this patch on BSD kernels. All of the BSD sysv implementations have a shm_use_phys optimization which forces the kernel to wire up memory pages used to back SysV segments. This increases performance by not requiring the allocation of pv entries for these pages and also reduces memory pressure. Most serious users of PostgreSQL on BSD platforms use this well-documented optimization. After switching to 9.3, large and well optimized Pg installations that previously ran well in memory will be forced into swap because of the pv entry overhead.
ReplyDelete
Replies
AnonymousSeptember 11, 2012 12:28 PM
finally, well done Robert.

ReplyDelete
Replies
AnonymousSeptember 14, 2012 10:53 AM
I am hoping this will help alleviate the other long standing problem: OOM Killer's badness() choosing Postgres unwisely [1]. Or am I hoping for too much?

[1] http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/
ReplyDelete
Replies
AnonymousNovember 04, 2012 3:47 AM
Given the description this is essentially a "political" patch, and we all know that politics are finnicky and have little to do with technical merits. On the technical side this means pgsql now depends on two out of three available ways to do the same thing, which actually enlarges the political vulnerability service --maybe some wit will insist on disabling mmap for yet another political stance-- so it'd be nice to be able to do just that, should the resident master tuner want to.

Given that the goal is to not require arcane tuning knowledge for non-tuners, there shouldn't be anything against the ability to jump back to pre-mmap using some tunable, and the dubious joy of tuning sysv shmen parameters.

Of course there's technical elegance in the One True Solution, but that's not a valid argument for patches that are essentially political in nature. Politics are messy, so don't go try impose neatness where flexibility is worth that much more, should you suddenly and unexpectedly need it.
ReplyDelete
Replies
JacobNovember 05, 2012 9:06 AM
Robert

What's your thought on the Postgres 9.3-dev performance on DragonFlyBSD?

Also, any chance we could get Postgres 9.3-dev tested with the lseek fix to show as a comparison to Scientific Linux 6.2?
ReplyDelete
Replies
JacobNovember 05, 2012 9:17 AM
I forgot to link to the DragonflyBSD / Postgres benchmark.

http://lists.dragonflybsd.org/pipermail/users/attachments/20121010/7996ff88/attachment-0002.pdf
ReplyDelete
Replies

Add comment