-ck hacking: linux

Showing posts with label linux. Show all posts

Monday, 12 December 2016

linux-4.9-ck1, MuQSS version 0.150

Announcing a new -ck release, 4.9-ck1 with new version of the Multiple Queue Skiplist Scheduler, version 0.150. These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but configurable for any workload.

linux-4.9-ck1

-ck1 patches:
http://ck.kolivas.org/patches/4.0/4.9/4.9-ck1/

Git tree:
https://github.com/ckolivas/linux/tree/4.9-ck

Ubuntu 16.04 LTS packages:
http://ck.kolivas.org/patches/4.0/4.9/4.9-ck1/Ubuntu16.04/

MuQSS

Download:
4.9-sched-MuQSS_150.patch

Git tree:
4.9-muqss

MuQSS 0.150 updates

Regarding MuQSS, apart from a resync to linux-4.9, which has numerous hotplug and cpufreq changes (again!), I've cleaned up the patch to not include any Hz changes of its own, leaving Hz changes up to users to choose, unless they use the -ck patchset.
Additionally, I've modified sched_yield yet again. Since expected behaviour is different for different (inappropriate) users out there of sched_yield, I've made it tunable in /proc/sys/kernel/yield_type and changed the default to what I believe should happen. From the documentation I added in Documentation/sysctl/kernel.txt:

yield_type: (MuQSS CPU scheduler only)

This determines what type of yield calls to sched_yield will perform.

0: No yield.
1: Yield only to better priority/deadline tasks. (default)
2: Expire timeslice and recalculate deadline.

Previous versions of MuQSS defaulted to type 2 above. If you find behavioural regressions with any of your workloads try switching it back to 2.

4.9-ck1 updates

Apart from resyncing with the latest trees from linux-bfq and wb-buf-throttling
- Added a new kernel configuration option to enable threaded IRQs and set it by default
- Changed Hz to default to the safe 100 value, removing 128 which caused spurious issues and had no real world advantage.
- Fixed a build for muqss disabled (why would you use -ck and do that I don't know)
- Made hrtimers not be used if we know we're in suspend which may have caused suspend failures for drivers that did no use correct freezable vs normal timeouts
- Enabled bfq and set it to default
- Enabled writeback throttling by default

Enjoy!
お楽しみ下さい
-ck

Tuesday, 22 November 2016

linux-4.8-ck8, MuQSS version 0.144

Here's a new release to go along with and commemorate the 4.8.10 stable release (they're releasing stable releases faster than my development code now.)

linux-4.8-ck8 patch:
patch-4.8-ck8.lrz

MuQSS by itself:
4.8-sched-MuQSS_144.patch

There are a small number of updates to MuQSS itself.
Notably there's an improvement in interactive mode when SMT nice is enabled and/or realtime tasks are running, or there are users of CPU affinity. Tasks previously would not schedule on CPUs when they were stuck behind those as the highest priority task and it would refuse to schedule them transiently.
The old hacks for CPU frequency changes from BFS have been removed, leaving the tunables to default as per mainline.
The default of 100Hz has been removed, but in its place a new and recommended 128Hz has been implemented - this just a silly microoptimisation to take advantage of the fast shifts that /128 has on CPUs compared to /100, and is close enough to 100Hz to behave otherwise the same.

For the -ck patch only I've reinstated updated and improved versions of the high resolution timeouts to improve behaviour of userspace that is inappropriately Hz dependent allowing low Hz choices to not affect latency.
Additionally by request I've added a couple of tunables to adjust the behaviour of the high res timers and timeouts.
/proc/sys/kernel/hrtimer_granularity_us
and
/proc/sys/kernel/hrtimeout_min_us

Both of these are in microseconds and can be set from 1-10,000. The first is how accurate high res timers will be in the kernel and is set to 100us by default (on mainline it is Hz accuracy).
The second is how small to make a request for a "minimum timeout" generically in all kernel code. The default is set to 1000us by default (on mainline it is one tick).

I doubt you'll find anything useful by tuning these but feel free to go nuts. Decreasing the second tunable much further risks breaking some driver behaviour.

Enjoy!
お楽しみ下さい
-ck

Saturday, 29 October 2016

linux-4.8-ck5, MuQSS version 0.120

Announcing a new version of MuQSS and a -ck release to go with it in concert with mainline releasing 4.8.5

4.8-ck5 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck5/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_120.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_120.patch

Git tree:
https://github.com/ckolivas/linux

This is a fairly substantial update to MuQSS which includes bugfixes for the previous version, performance enhancements, new features, and completed documentation. This will likely be the first publicly announced version on LKML.

EDIT: Announce here: LKML

New features:
- MuQSS is now a tickless scheduler. That means it can maintain its guaranteed low latency even in a build configured with a low Hz tick rate. To that end, it is now defaulting to 100Hz, and it is recommended to use this as the default choice for it leads to more throughput and power savings as well.
- Improved performance for single threaded workloads with CPU frequency scaling.
- Full NoHZ now supported. This disables ticks on busy CPUs instead of just idle ones. Unlike mainline, MuQSS can do this virtually all the time, regardless of how many tasks are currently running. However this option is for very specific use cases (compute servers running specific workloads) and not for regular desktops or servers.
- Numerous other configuration options that were previously disabled from mainline are now allowed again (though not recommended for regular users.)
- Completed documentation can now be found in Documentation/scheduler/sched-MuQSS.txt
Bugfixes:
- Fix for the various stalls some people were still experiencing, along with the softirq pending warnings.
- Fix for some loss of CPU for heavily sched_yielding tasks.
- Fix for the BFQ warning (-ck only)

Enjoy!
お楽しみ下さい
-ck

Monday, 24 October 2016

linux-4.8-ck4, MuQSS CPU scheduler v0.116

Yet another bugfix release for MuQSS and the -ck patchset with one of the most substantial latency fixes yet. Everyone should upgrade if they're on a previous 4.8 patchset of mine. Sorry about the frequency of these releases but I just can't allow a known buggy release be the latest version.

4.8-ck4 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck4/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_116.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_116.patch

I'm hoping this is the release that allows me to not push any more -ck versions out till 4.9 is released since it addresses all remaining issues that I know about.

A lingering bug that has been troubling me for some time was leading to occasional massive latencies and thanks to some detective work by Serge Belyshev I was able to narrow it down to a single line fix which dramatically improves worst case latency when measured. Throughput is virtually unchanged. The flow-on effect to other areas was also apparent with sometimes unused CPU cycles and weird stalls on some workloads.

Sched_yield was reverted to the old BFS mechanism again which GPU drivers prefer but it wasn't working previously on MuQSS because of the first bug. The difference is substantial now and drivers (such as nvidia proprietary) and apps that use it a lot (such as the folding @ home client) behave much better now.

The late introduced bugs that got into ck3/muqss115 were reverted.

The results come up quite well now with interbench (my latency under load benchmark) which I have recently updated and should now give sensible values:

https://github.com/ckolivas/interbench

If you're baffled by interbench results, the most important number is %deadlines met which should be as close to 100% as possible followed by max latency which should be as low as possible for each section. In the near future I'll announce an official new release version.

Pedro in the comments section previously was using runqlat from bcc tools to test latencies as well, but after some investigation it became clear to me that the tool was buggy and did not work properly with bfs/muqss either so I've provided a slightly updated version here which should work properly:

runqlat.py

Enjoy!
お楽しみ下さい
-ck

Friday, 21 October 2016

linux-4.8-ck2, MuQSS version 0.114

Announcing an updated version, and the first -ck release with MuQSS as the scheduler, officially retiring BFS from further development, in line with the diminished rate of bug reports with MuQSS. It is clear that the little attention BFS had received over the years apart from rushed synchronisation with mainline had cause a number of bugs to creep in and MuQSS is basically a rewritten evolution of the same code so it makes no sense to maintain both.

http://ck.kolivas.org/patches/4.0/4.8/4.8-ck2/

MuQSS version 0.114 by itself:

4.8-sched-MuQSS_114.patch

Git tree includes branches for MuQSS and -ck:

https://github.com/ckolivas/linux

In addition to the most up to date version of MuQSS replacing BFS, this is the first release with BFQ included. It is configurable and is set by default in -ck though it is entirely optional.

The MuQSS changes since 112 are as follows:
- Added cacheline alignment to atomic variables courtesy of Holger Hoffstätte
- Fixed PPC build courtesy of Serge Belyshev.
- Implemented wake lists for separate CPU packages.
- Send hotplug threads to CPUs even if they're not alive yet since they'll be enabling them.
- Build fixes for uniprocessor.
- A substantial revamp of the sub-tick process accounting, decreasing the number of variables used, simplifying the code, and increasing the resolution to nanosecond accounting. Now even tasks that run for less than 100us will not escape visible accounting.

This release should bring slightly better performance, more so on multi-cpu machines, and fairer accounting/latency.

Enjoy!
お楽しみ下さい
-ck

Tuesday, 18 October 2016

First MuQSS Throughput Benchmarks

The short version graphical summary:

Red = MuQSS 112 interactive off
Purple = MuQSS 112 interactive on
Blue = CFS

The detail:
http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/

I went on a journey looking for meaningful benchmarks to conduct to assess the scalability aspect as far as I could on my own 12x machine and was really quite depressed to see what the benchmark situation on linux is like. Only the old and completely invalid benchmarks seem to still be hanging around in public sites and promoted, like Reaim, aim7, dbench, volanomark, etc. and none of those are useful scalability benchmarks. Even more depressing was the only ones with any reputation are actually commercial benchmarks costing hundreds of dollars.

This made me wonder out loud just how the heck mainline is even doing scalability improvements if there are precious few valid benchmarks for linux and no one's using them. The most promising ones, like mosbench, need multiple machines and quite a bit of set up to get them going.

I spent a day wading through the phoronix test suite - a site and its suite not normally known for meaningful high performance computing discussion and benchmarks - looking for benchmarks that could be used for meaningful results for multicore scalability assessment and were not too difficult to deploy and came up with the following collection:

John The Ripper - a CPU bound application that is threaded to the number of CPUs and intermittently drops to one thread making for slightly more interesting behaviour than just a fully CPU bound workload.

7-Zip Compression - a valid real world CPU bound application that is threaded but rarely able to spread out to all CPUs making it an interesting light load benchmark.

ebizzy - This emulates a heavy content delivery server load which scales beyond the number of CPUs and emulates what goes on between a http server and database.

Timed Linux Kernel Compilation - A perennial favourite because it is a real world case and very easy to reproduce. Despite numerous complaints about its validity as a benchmark, it is surprisingly consistent in its results and tests many facets of scalability, though does not scale to use all CPUs at all time either.

C-Ray - A ray tracing benchmark that uses massive threading per CPU and is completely CPU bound but overloads all CPUs.

Primesieve - A prime number generator that is threaded to the number of CPUs exactly, is fully CPU bound and is cache intensive.

PostgreSQL pgbench - A meaningful database benchmark that is done at 3 different levels - single threaded, normal loaded and heavily contended, each testing different aspects of scalability.

And here is a set of results comparing 4.8.2 mainline (labelled CFS), MuQSS 112 in interactive mode (MuQSS-int1) and MuQSS 112 in non-interactive mode (MuQSS-int0):

http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/

It's worth noting that there is quite a bit of variance in these benchmarks and some are bordering on the difference being just noise. However there is a clear pattern here - when the load is light, in terms of throughput, CFS outperforms MuQSS. When load is heavy, the heavier it gets, MuQSS outperforms CFS, especially in non-interactive mode. As a friend noted, for the workloads where you wouldn't be running MuQSS in interactive mode, such as a web server, database etc, non-interactive mode is of clear performance benefit. So at least on the hardware I had available to me, on a 12x machine, MuQSS is scaling better than mainline on these workloads as load increases.

The obvious question people will ask is why MuQSS doesn't perform better at light loads, and in fact I have an explanation. The reason is that mainline tends to cling to processes much more so that if it is hovering at low numbers of active processes, they'll all cluster on one CPU or fewer CPUs than being spread out everywhere. This means the CPU benefits more from the turbo modes virtually all newer CPUs have, but it comes at a cost. The latency to tasks is greater because they're competing for CPU time on fewer busy CPUs rather than spreading out to idle cores or threads. It is a design decision in MuQSS, as taken from BFS, to always spread out to any idle CPUs if they're available, to minimise latency, and that's one of the reasons for the interactivity and responsiveness of MuQSS. Of course I am still investigating ways of closing that gap further.

Hopefully I can get some more benchmarks from someone with even bigger hardware, and preferably with more than one physical package since that's when things really start getting interesting. All in all I'm very pleased with the performance of MuQSS in terms of scalability on these results, especially assuming I'm able to maintain the interactivity of BFS which were my dual goals.

There is MUCH more to benchmarking than pure throughput of CPU - which is almost the only thing these benchmarks is checking - but that's what I'm interested in here. I hope that providing my list of easy to use benchmarks and the reasoning behind them can generate interest in some kind of meaningful standard set of benchmarks. I did start out in kernel development originally after writing and being a benchmarker :P

To aid that, I'll give simple instructions here for how to ~imitate the benchmarks and get results like I've produced above.

Download the phoronix test suite from here:
http://www.phoronix-test-suite.com/

The generic tar.gz is perfectly fine. Then extract it and install the relevant benchmarks like so:


tar xf phoronix-test-suite-6.6.1.tar.gz

cd phoronix-test-suite

./phoronix-test-suite install build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve

./phoronix-test-suite default-run build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve

Now obviously this is not ideal since you shouldn't run benchmarks on a multiuser login with Xorg and all sorts of other crap running so I actually always run benchmarks at init level 1.

Enjoy!
お楽しみ下さい
-ck

Tuesday, 11 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.111

Lots of bugfixes, lots of improvements, build fixes, you name it.

For 4.8:
4.8-sched-MuQSS_111.patch

For 4.7:
4.7-sched-MuQSS_111.patch

And in a complete departure from BFS, a git tree (which suits constant development like this, unlike BFS's stable release massive ports):

https://github.com/ckolivas/linux

Look in the pending/ directory to see all the patches that went into this or read the git changelog. In particular numerous warnings were fixed, throughput improved compared to 108, SCHED_ISO was rewritten for multiple queues, potential races/crashes were addressed, and build fixes for different configurations were committed.

I haven't been able to track the bizarre latency issues reported by runqlat and when I try to reproduce it myself I get nonsense values of latency greater than the history of the earth so I suspect an interface bug with BPF reporting values. It doesn't seem to affect actual latency in any way.

EDIT: Updated to version 0.111 which has a fix for suspend/resume.

Enjoy!
お楽しみ下さい
-ck

Friday, 7 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.108

A new version of the MuQSS CPU scheduler

Incrementals and full patches available for 4.8 and 4.7 respectively here:
http://ck.kolivas.org/patches/muqss/4.0/4.8/

http://ck.kolivas.org/patches/muqss/4.0/4.7/

Yet more minor bugfixes and some important performance enhancements.

This version brings to the table the same locking scheme for trying to wake tasks up as mainline which is advantageous on process busy workloads and many CPUs. This is important because the main reason for moving to multiple runqueues was to minimise lock contention for the global runqueue lock that is in BFS (as mentioned here numerous times before) and this wake up scheme helps make the most of the multiple discrete runqueue locks.

Note this change is much more significant than the last releases so new instability is a possibility. Please report any problems or stacktraces!

There was a workload when I started out that I used lockstat to debug to get an idea of how much lock contention was going on and how long it lasted. Originally with the first incarnations of MuQSS on a 14 second benchmark with thousands of tasks on a 12x CPU it obtained 3 million locks and had almost 300k contentions with the longest contention lasting 80us. Now the same workload grabs the lock just 5k times with only 18 contentions in total and the longest lasted 1us.

This clearly demonstrates that the target endpoint for avoiding lock contention has been achieved. It does not translate into performance improvements on ordinary hardware today because you need ridiculous workloads on many CPUs to even begin deriving advantage from it. However as even our phones now have reached 8 logical CPUs, it will only be a matter of time before 16 threads appears on commodity hardware - a complaint that was directed at BFS when it came out 7 years ago but they still haven't appeared just yet. BFS was shown to be scalable for all workloads up to 16 CPUs, and beyond for certain workloads, but suffered dramatically for others. MuQSS now makes it possible for what was BFS to be useful much further into the future.

Again - MuQSS is aimed primarily at desktop/laptop/mobile device users for the best possible interactivity and responsiveness, and is still very simple in its approach to balancing workloads to CPUs so there are likely to be throughput workloads on mainline that outperform it, though there are almost certainly workloads where the opposite is true.

I've now addressed all planned changes to MuQSS and plan to hopefully only look at bug reports instead of further development from here on for a little while. In my eyes it is now stable enough to replace BFS in the next -ck release barring some unexpected showstopper bug appearing.

EDIT: If you blinked you missed the 107 announcement which was shortly superseded by 108.

EDIT2: Always watch the pending directory for updated pending patches to add.
http://ck.kolivas.org/patches/muqss/4.0/4.8/Pending/

Enjoy!
お楽しみ下さい
-ck

Thursday, 5 September 2013

Microsleeps and operating systems

As an anaesthetist, I spend a lot of time and effort dealing with and understanding sleep. So it's mildly amusing that I spend a lot of time dealing with sleeps in various forms in code. Previously it was the effect sleep has on scheduling while developing and working on the linux kernel scheduler, and now it's dealing with writing drivers for various hardware for cgminer.

What I'm predominantly interested in is dealing with microsleeps - which ironically is also the name the airline safety industry calls it when an airline pilot nods off temporarily when they're meant to be awake. I'm sure you've all had that experience at some stage, hopefully not while driving.

Anyway the mining hardware scene for bitcoin has moved to the first generation of ASIC devices, and in that scene, the faster you can market and sell your product, the greater the potential profit for the manufacturers. Thus, not a lot of care often goes into the interface between the ASIC chips and the operating system, leading to really poorly designed MCUs and FPGA firmware. The task is simple enough - send the device some work, get one or more responses back, load more work, rinse and repeat. Some devices don't have buffers for queueing work, and some don't have buffers for responses, leading to scenarios where time to getting the response and loading more work becomes more and more critical. Some are so badly designed that they have response codes 64 bytes long and send it out on a device with a buffer that only fits 62 bytes. Basically most of them expect you to repeatedly poll the device for results and retrieve them, followed by sending them more results.

Now controlled small amounts of polling has its place in certain circumstances and busy waiting on a condition is potentially going to be faster than waiting in sleep due to scheduling wake up delays and so on. However this is only for microsecond timeframes and provided you don't need to be doing it continuously, and the small amount of extra power usage over that period is not significant, and you don't get yourself into a priority inversion somehow.

None of the hardware I am dealing with really works in those kind of timeframes, and repeatedly polling would be incredibly wasteful of CPU, and I have a horrible aversion to wasted CPU cycles just asking a device if it's ready or not in an infinite loop. However, because of the lousy designs of some of this hardware, we are dealing with sleeps in the order of 1-40ms. So it's just outside the microsecond resolution time frames, but only just in the worst case scenario. Those of you who've coded sleeps in these sized sleeps would know that the jitter in the kernel timers alone is often in the order of 2ms, and scheduling delays can be much larger under load conditions. Some hardware is much worse, and some operating systems (eg windows) by default have only 15ms granularity unless you force it to operate at higher resolution.

Lots of hardware has asynchronous mechanisms so it can get far more complicated, but we can demonstrate the issues even with the simplest of designs, so let's assume a simple 1 work item, 1 result synchronous design with say a minimum of 10ms between work and results (contrived example for demonstration purposes) on a uniprocessor machine.

So here lies the problem:

1. Send work.
2. Sleep 10ms.
3. Check for results.

Looks simple enough. Even assuming 2ms jitter, the worst thing that can happen is we wait an extra 2ms which is not a profound delay. Let's ignore the jitter for this discussion.

However the following could happen on the uniprocessor machine:

1. Send work.
1a. Application gets descheduled for another process to take its place by the operating system kernel for 10ms.
2. Sleep 10ms.
2a. Application gets descheduled again for another 10ms.
3. Check for results.

Admittedly this is the worst case scenario, but our 10ms wait becomes 30ms, so this is no longer insignificant. Assuming the common scenario is we only sleep 10ms and occasionally 1a (more likely) OR 2a happens, with the worst case scenario almost never happening, we can mitigate the disaster by making the intermediate sleep smaller, to say half of what it was. Now we have something that sleeps somewhere between 5 and 15ms. Not too bad, except that for the common case we are polling twice as often now, and worst case scenario is still a lot longer than we'd like.

Even if we accept this, we encounter a new problem with sleep, assuming we use a readily available accurate timer such as nanosleep(). Nanosleep does not guarantee we will sleep the amount we asked for and happily gets interrupted by signals, then returning how much it had slept instead of sleeping for the amount we asked. Therefore we have to handle nanosleep() returning having slept less than asked for, retrieve how much we slept for, calculate how much more we need to sleep, and then run nanosleep() again.

2a. Ask for sleep 5ms
2a1. Signal interrupts sleep
2a2. Returns after 2ms
2b. Calculate we need to sleep another 3ms
2c. Ask to sleep 3ms
etc.
.
.
3. Check for results

Now you'd be real unlucky for this to happen multiple times over, but even this happening once we now have quite a few more potential places where the application can get descheduled, thus making it possible that we make it to step 3 much later than intended.

So what do we do? Do we halve the sleep even further to 2ms? That makes it far less likely we'll get a signal but we're now down to close to the resolution of the kernel sleep timers themselves and we run into a new problem should we run on multiprocessor systems (which for all intents and purposes, pretty much everything is these days). The clock on each CPU is transiently slightly different and the kernel cheats by picking the smallest difference that can guarantee it's not going to sleep too long. In my instrumenting of this, I found that most calls to nanosleep only slept 2/3 of the time I asked for on a quad core CPU.

At 1-2ms sleep times, we are now getting dangerously close to just busy waiting anyway and substantially increase the risk of our application getting descheduled because it is now CPU bound.

So after much angst it was clear that the only way to minimise this problem was to move to absolute timers so that we left it up to the operating system to figure out how much it should (or should not!) sleep for. To guarantee we never slept too much, and allowed ourselves some freedom to poll after the sleep period I initially chose the following:

1. Get time
2. Send work.
3. clock_nanosleep() to an absolute time 10ms after that retrieved in 1.
4. Check for results.

We still have the potential for descheduling and extra delays to occur after 3 and before 4, but most operating system kernels will give precedence to a process that has slept and is waking up so this is actually likely to be relatively rare. It also means should we get descheduled somewhere between 1 and 3, the operating system actually won't put our process to sleep at all.

Only...

clock_nanosleep() being POSIX.1-2001 doesn't guarantee it will work everywhere of course. And indeed this time only linux supported it. Unlike the function calls for anonymous semaphores which I mentioned in my previous post that were present blank functions that returned ENOSYS, these functions did not exist at all on OSX. Nor were they present on mingw32 on windows. (This time I will not pass judgement on this...)

Now I am aware there are other timers on the other operating systems, but I needed a workaround that would perform no worse than just calling nanosleep() till I tackled them one operating system at a time since I don't know really know these operating systems intimately. So for the time being what I do on windows and OSX is:

1. Get time.
2. Send work
3. Calculate how long to sleep 10ms relative to time retrieved in 1.
4. Nanosleep that duration.
5. Check for results.

So far this is performing better than using ordinary nanosleep was. Why? I mean it just looks like it is a more complicated way of doing a relative nanosleep and should perform worse. It turns out that task 2 - send work, takes a variable amount of time to perform itself, and we should start timing from when we first started the function call to send work.

Well, I was going to talk about clocks as well, but this post ended up being much longer than I anticipated, so I'll save that for next time (no pun intended).

Monday, 2 September 2013

Unnamed semaphores and POSOSX

During the development of my bitcoin mining software, cgminer, I've used just about every synchronisation primitive due to it being heavily multithreaded. A few months back I used some semaphores and the first thing I reached for was the much more useful unnamed semaphores commonly in use today. Classic SYSV IPC semaphores are limited in number, require allocating of shared memory, stay in use till destroyed or the system rebooted etc. etc. that make them real awkward to use and far less flexible so I never even considered using them. For some reason, though, I had a vague memory of trying to use them on lrzip years ago and deciding not too. Well that memory came back to bite me.

Cgminer is cross platform, working reasonably well on various architectures with Linux, windows (via mingw32) and OSX mainly, though other brave souls have used it on all sorts of things. I've often heard OSX described as the "Fischer Price" unix, AKA "My first unix" because of its restricted subset of unix capabilities that it has, although I'm led to believe it claimed to have POSIX compliance at some stage - though I never really investigated it nor does it really matter since Linux is only POSIXy at best.

So the interesting thing was that I had written some code for cgminer which used unnamed semaphores and it compiled fine across the 3 main platforms, but it failed miserably when it came to working on OSX. Of note, the unnamed semaphore functions conform to POSIX.1-2001. All of the functions compiled perfectly fine, but the application refused to run properly, and finally when I got some of the OSX users to investigate further, every single unnamed semaphore function, such as sem_init, sem_post, sem_wait etc, would return a unique OSX error which when deciphered it was actually "Unimplemented feature". Quite amusing that to get POSIX compliance it only had to implement the functions, but not the actual features of those functions... You may go wild with speculation as to why this may be. This is why I coined the term POSOSX.

After toying with the idea of using SYSV semaphores and being disgusted at the thought, I finally decided that I should just implement really basic fake unnamed semaphores using pipes on OSX to imitate their behaviour.

Simplified code from cgminer for OSX follows (real code checks return values etc.):

struct cgsem {
    int pipefd[2];
};

typedef struct cgsem cgsem_t;

void cgsem_init(cgsem_t *cgsem)
{
    int flags, fd, i;

    pipe(cgsem->pipefd);

    /* Make the pipes FD_CLOEXEC to allow them to close should we call
    * execv on restart. */
    for (i = 0; i < 2; i++) {
        fd = cgsem->pipefd[i];
        flags = fcntl(fd, F_GETFD, 0);
        flags |= FD_CLOEXEC;
        fcntl(fd, F_SETFD, flags);
    }
}

void cgsem_post(cgsem_t *cgsem)
{
    const char buf = 1;

    write(cgsem->pipefd[1], &buf, 1);
}

void cgsem_wait(cgsem_t *cgsem)
{
    char buf;

    read(cgsem->pipefd[0], &buf, 1);
}

void cgsem_destroy(cgsem_t *cgsem)
{
    close(cgsem->pipefd[1]);
    close(cgsem->pipefd[0]);
}