-ck hacking: scalability

Showing posts with label scalability. Show all posts

Tuesday, 18 October 2016

First MuQSS Throughput Benchmarks

The short version graphical summary:

Red = MuQSS 112 interactive off
Purple = MuQSS 112 interactive on
Blue = CFS

The detail:
http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/

I went on a journey looking for meaningful benchmarks to conduct to assess the scalability aspect as far as I could on my own 12x machine and was really quite depressed to see what the benchmark situation on linux is like. Only the old and completely invalid benchmarks seem to still be hanging around in public sites and promoted, like Reaim, aim7, dbench, volanomark, etc. and none of those are useful scalability benchmarks. Even more depressing was the only ones with any reputation are actually commercial benchmarks costing hundreds of dollars.

This made me wonder out loud just how the heck mainline is even doing scalability improvements if there are precious few valid benchmarks for linux and no one's using them. The most promising ones, like mosbench, need multiple machines and quite a bit of set up to get them going.

I spent a day wading through the phoronix test suite - a site and its suite not normally known for meaningful high performance computing discussion and benchmarks - looking for benchmarks that could be used for meaningful results for multicore scalability assessment and were not too difficult to deploy and came up with the following collection:

John The Ripper - a CPU bound application that is threaded to the number of CPUs and intermittently drops to one thread making for slightly more interesting behaviour than just a fully CPU bound workload.

7-Zip Compression - a valid real world CPU bound application that is threaded but rarely able to spread out to all CPUs making it an interesting light load benchmark.

ebizzy - This emulates a heavy content delivery server load which scales beyond the number of CPUs and emulates what goes on between a http server and database.

Timed Linux Kernel Compilation - A perennial favourite because it is a real world case and very easy to reproduce. Despite numerous complaints about its validity as a benchmark, it is surprisingly consistent in its results and tests many facets of scalability, though does not scale to use all CPUs at all time either.

C-Ray - A ray tracing benchmark that uses massive threading per CPU and is completely CPU bound but overloads all CPUs.

Primesieve - A prime number generator that is threaded to the number of CPUs exactly, is fully CPU bound and is cache intensive.

PostgreSQL pgbench - A meaningful database benchmark that is done at 3 different levels - single threaded, normal loaded and heavily contended, each testing different aspects of scalability.

And here is a set of results comparing 4.8.2 mainline (labelled CFS), MuQSS 112 in interactive mode (MuQSS-int1) and MuQSS 112 in non-interactive mode (MuQSS-int0):

http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/

It's worth noting that there is quite a bit of variance in these benchmarks and some are bordering on the difference being just noise. However there is a clear pattern here - when the load is light, in terms of throughput, CFS outperforms MuQSS. When load is heavy, the heavier it gets, MuQSS outperforms CFS, especially in non-interactive mode. As a friend noted, for the workloads where you wouldn't be running MuQSS in interactive mode, such as a web server, database etc, non-interactive mode is of clear performance benefit. So at least on the hardware I had available to me, on a 12x machine, MuQSS is scaling better than mainline on these workloads as load increases.

The obvious question people will ask is why MuQSS doesn't perform better at light loads, and in fact I have an explanation. The reason is that mainline tends to cling to processes much more so that if it is hovering at low numbers of active processes, they'll all cluster on one CPU or fewer CPUs than being spread out everywhere. This means the CPU benefits more from the turbo modes virtually all newer CPUs have, but it comes at a cost. The latency to tasks is greater because they're competing for CPU time on fewer busy CPUs rather than spreading out to idle cores or threads. It is a design decision in MuQSS, as taken from BFS, to always spread out to any idle CPUs if they're available, to minimise latency, and that's one of the reasons for the interactivity and responsiveness of MuQSS. Of course I am still investigating ways of closing that gap further.

Hopefully I can get some more benchmarks from someone with even bigger hardware, and preferably with more than one physical package since that's when things really start getting interesting. All in all I'm very pleased with the performance of MuQSS in terms of scalability on these results, especially assuming I'm able to maintain the interactivity of BFS which were my dual goals.

There is MUCH more to benchmarking than pure throughput of CPU - which is almost the only thing these benchmarks is checking - but that's what I'm interested in here. I hope that providing my list of easy to use benchmarks and the reasoning behind them can generate interest in some kind of meaningful standard set of benchmarks. I did start out in kernel development originally after writing and being a benchmarker :P

To aid that, I'll give simple instructions here for how to ~imitate the benchmarks and get results like I've produced above.

Download the phoronix test suite from here:
http://www.phoronix-test-suite.com/

The generic tar.gz is perfectly fine. Then extract it and install the relevant benchmarks like so:


tar xf phoronix-test-suite-6.6.1.tar.gz

cd phoronix-test-suite

./phoronix-test-suite install build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve

./phoronix-test-suite default-run build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve

Now obviously this is not ideal since you shouldn't run benchmarks on a multiuser login with Xorg and all sorts of other crap running so I actually always run benchmarks at init level 1.

Enjoy!
お楽しみ下さい
-ck

Tuesday, 11 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.111

Lots of bugfixes, lots of improvements, build fixes, you name it.

For 4.8:
4.8-sched-MuQSS_111.patch

For 4.7:
4.7-sched-MuQSS_111.patch

And in a complete departure from BFS, a git tree (which suits constant development like this, unlike BFS's stable release massive ports):

https://github.com/ckolivas/linux

Look in the pending/ directory to see all the patches that went into this or read the git changelog. In particular numerous warnings were fixed, throughput improved compared to 108, SCHED_ISO was rewritten for multiple queues, potential races/crashes were addressed, and build fixes for different configurations were committed.

I haven't been able to track the bizarre latency issues reported by runqlat and when I try to reproduce it myself I get nonsense values of latency greater than the history of the earth so I suspect an interface bug with BPF reporting values. It doesn't seem to affect actual latency in any way.

EDIT: Updated to version 0.111 which has a fix for suspend/resume.

Enjoy!
お楽しみ下さい
-ck

Friday, 7 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.108

A new version of the MuQSS CPU scheduler

Incrementals and full patches available for 4.8 and 4.7 respectively here:
http://ck.kolivas.org/patches/muqss/4.0/4.8/

http://ck.kolivas.org/patches/muqss/4.0/4.7/

Yet more minor bugfixes and some important performance enhancements.

This version brings to the table the same locking scheme for trying to wake tasks up as mainline which is advantageous on process busy workloads and many CPUs. This is important because the main reason for moving to multiple runqueues was to minimise lock contention for the global runqueue lock that is in BFS (as mentioned here numerous times before) and this wake up scheme helps make the most of the multiple discrete runqueue locks.

Note this change is much more significant than the last releases so new instability is a possibility. Please report any problems or stacktraces!

There was a workload when I started out that I used lockstat to debug to get an idea of how much lock contention was going on and how long it lasted. Originally with the first incarnations of MuQSS on a 14 second benchmark with thousands of tasks on a 12x CPU it obtained 3 million locks and had almost 300k contentions with the longest contention lasting 80us. Now the same workload grabs the lock just 5k times with only 18 contentions in total and the longest lasted 1us.

This clearly demonstrates that the target endpoint for avoiding lock contention has been achieved. It does not translate into performance improvements on ordinary hardware today because you need ridiculous workloads on many CPUs to even begin deriving advantage from it. However as even our phones now have reached 8 logical CPUs, it will only be a matter of time before 16 threads appears on commodity hardware - a complaint that was directed at BFS when it came out 7 years ago but they still haven't appeared just yet. BFS was shown to be scalable for all workloads up to 16 CPUs, and beyond for certain workloads, but suffered dramatically for others. MuQSS now makes it possible for what was BFS to be useful much further into the future.

Again - MuQSS is aimed primarily at desktop/laptop/mobile device users for the best possible interactivity and responsiveness, and is still very simple in its approach to balancing workloads to CPUs so there are likely to be throughput workloads on mainline that outperform it, though there are almost certainly workloads where the opposite is true.

I've now addressed all planned changes to MuQSS and plan to hopefully only look at bug reports instead of further development from here on for a little while. In my eyes it is now stable enough to replace BFS in the next -ck release barring some unexpected showstopper bug appearing.

EDIT: If you blinked you missed the 107 announcement which was shortly superseded by 108.

EDIT2: Always watch the pending directory for updated pending patches to add.
http://ck.kolivas.org/patches/muqss/4.0/4.8/Pending/

Enjoy!
お楽しみ下さい
-ck

Saturday, 1 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.105

Announcing a multiple runqueue variant of BFS, with the more mundane name of MuQSS (pronounced mux) for linux 4.7:

Full patch for linux-4.7
4.7-sched-MuQSS_105.patch

Keep watching this blog for newer versions!

Incremental to patch bfs502 to MuQSS 0.1:
bfs502-MuQSS_103.patch

It was inevitable that one day I would find myself tackling the 2 major scalability limitations in BFS and this is the result of it. These two issues were

The single runqueue which means all CPUs would fight for lock contention over the one runqueue, and
The O(n) look up which means linear increase in overhead for task lookups as number of processes increases.

As you're all aware by now, skiplists were recently introduced into BFS to tackle number 2 with a modest improvement in throughput at high loads.

Till now I did not have the energy nor time to try and find a solution for number 1. that maintained BFS' scheduling decision algorithm as the single runqueue was actually the reason latency remains bound and deterministic on BFS, capitalising with more CPUs instead of fighting against them for scalability.

This scheduler variant is an evolution of BFS, which hopefully will be mature enough to replace BFS one day when stability is assured. It is able to still use the same scheduling algorithm as BFS meaning latency and responsiveness remains as good as always, but with the per-CPU runqueue and discrete locking, it also means it will scale to any number of CPUs, as the mainline scheduler does.

It does NOT guarantee the best possible throughput as there still is virtually no complex balancing mechanism whatsoever, selecting tasks according to deadline primarily with only CPU cache distances being used to determine which idle CPU to go to, or in non-interactive mode, which overloaded CPU to pull from to fill an idle CPU.

It would be possible, with a lot of effort, to wedge the entire balancing algorithm for scalability from mainline into this, though it will probably offset the deterministic latency that makes it special.

This is a massive rewrite and consequently there are bound to still be race conditions and hidden bugs though I have been running it for a while now with reasonable stability. I'm putting this out there for the braver people to test. There's a lot more to document about it but for now let's just say, give it a try.

Please don't use any lock debugging as it will light up every possible complaint for the time being!

Regarding 4.8, for the time being I will still be releasing BFS for it and incorporate it into -ck

EDIT: Updated to version 0.105 with significant bugfixes.

Enjoy!
お楽しみ下さい
-ck

Tuesday, 13 September 2016

BFS 497, linux-4.7-ck4

For the first time in a very long time, I'm announcing yet another -ck release up to ck4 along with yet more substantial updates for BFS for linux-4.7 based kernels.

BFS by itself:
4.7-sched-bfs-497.patch

-ck branded linux-4.7-ck4 patches:
linux-4.7-ck4

Thanks(?) to the massive changes to the mainline kernel I'd been forced to rewrite significant components of BFS to work properly with them, specifically the cpu frequency governors. At the same time I've had quite a bit of energy and enthusiasm for working on BFS in a way I haven't had in a long time. As a result, this updated version not only addresses the remaining cgroup stub patch bug (mentioned on the previous announcement) but implements further improvements and clean ups to go with those improvements.

Alas I still have no explanation for the random lockups some people are seeing, but I have seen reports of it happening on mainline kernels as well now, so while I'm always suspicious of my own code, there is also the chance that BFS exacerbates an issue in mainline. Something that appears common is onboard Intel graphics with the Haswell chipset.

Additionally I had reports of people being unable to suspend with BFS from 4.7 but I haven't heard back from them on later versions.

The short summary of improvements in this version are less overhead, higher throughput and less latencies.

I've rewritten the skiplist implementation to not require a malloc/free on insertion/removal of a new node which seemed to noticeably improve throughput at high loads.
Now that CPU frequency governors know what the scheduler is doing, the approach of BFS of old of knowing what the governor was doing and working around it is no longer helpful and I've removed the whole sticky task and offset for throttled CPUs and throughput has actually improved instead.
I've also added some micro-optimisations and cleanups.
I've added a minor change for offlining CPUs to prevent tasks trying to schedule to them.

The set of patches in ck4 is the largest in the ck patchset since the early 2.6 patchset days. I've also included the patch from Alfred (thanks!) to fix the warning that happens with suspend which is mostly harmless.

Each patch included has a mini changelog at the top.

I'm also keen to get feedback from people on if they see any noticeable interactive/responsiveness regressions by disabling the interactive flag as follows:

echo 0 > /proc/sys/kernel/interactive

Enjoy!
お楽しみ下さい
-ck

Tuesday, 29 July 2014

Revisiting URW locks for BFS

A couple of years ago I experimented with upgradeable read-write locks to replace the current spinlock that protects all the data in the global runqueue in BFS. Here is the original description: upgradeable-rwlocks-and-bfs

One of the main concerns when using a single runqueue which BFS does, as opposed to multiple runqueues, as the mainline linux scheduler does, is lock contention, where multiple CPUs can be waiting on getting access to read and/or modify data protected by the lock protecting all the data on the single runqueue. This data is currently protected by a spinlock, but there is clear demarcation in areas of the scheduler where we're only interested in reading data and others where we have to write data, which is why I developed the URW locks in the first place.

A brief reminder of why I have developed upgradeable read write locks instead of using regular read-write locks is that if there are multiple readers holding an RW lock, write access is starved until they all release the lock. This situation is unacceptable for a scheduler where it is mandatory that writes take more precedence than reads and the upgradeable version has write precedence as a feature as well as being upgradeable, thus allowing us to hold exclusive write access only for as long as we need it. Their main disadvantage is they are much heavier in overhead since they're effectively using double locks hence they would need to be used only where there is a clear work case for them.

I've updated the urwlocks patch and posted it here:
urw-locks.patch
Note this patch by itself does nothing unless other code uses the locks.

In my original experiments I had done the conversion and performed various benchmarks on a quad core hyperthreaded CPU and had seen neither benefit nor detriment. The locks were originally developed with a view to expanding on the BFS design for scalability improvements primarily, and time and lack of success in deriving benefit from that development led to me shelving the project.

I now have access to a hex core hyperthreaded CPU which acts like 12 logical CPUs and felt it was time to revisit the idea. This time I rehashed the use of URW locks for the global runqueue and was even more aggressive in trying to spend as little time with the write lock held as possible. After extensive benchmarking, though, my conclusions were even worse than last time: Not only did it not improve performance, it statistically significantly ever so slightly worsened throughput. This suggests that whatever scalability improvements are there to be gained by decreasing lock contention are offset by the increased overhead of URW locks versus spinlocks.

The updated version of this patch that depends on the urw locks patch above (to be applied on top of BFS 449 and all pending patches) is here:
bfs449-grq_urwlocks.patch

It was interesting to note that for whatever reason the context switch rate actually decreased under load compared to regular BFS suggesting the discrete read paths helped contribute to less requirement for rescheduling suggesting it could lead to benefit if not for the increased locking overhead.

In some ways I'm not surprised as complexity and added infrastructure for the sake of apparent benefit generically does not guarantee improvement and simpler code, if not ideal in design, can work better in real world workloads. There are three discrete examples of this now in my experiments with BFS:

1. Single vs multiple runqueues.

The identical design in basic algorithm for BFS when applied to multiple runqueues versus the single runqueue shows no measurable increase in lock contention or even cache trashing on any regularly available hardware (confirmed on up to dual 12 threaded CPU machines).

2. Spinlocks vs rwlocks.

As per this post.

3. O(n) vs O(ln(n)) lookup.

Any improvement in the lookup time of many tasks is offset by the added complexity of insertion and that lookup and what the real world sizes of n will be. The limiting behaviour of the function describes how it changes with n only, it does not tell you what the absolute overhead equates to in real world workloads.

-ck

Monday, 11 June 2012

Upgradeable rwlocks and BFS

I've been experimenting with improving the locking scheme in BFS and had a few ideas. I'm particularly attached to the global runqueue that makes BFS what it is and obviously having one queue that has one lock protecting all its data structures accessed by multiple CPUs will end up having quite significant scalability limits with many CPUs. Much like a lot of the scalability work, it tends to have the opposite effect with smaller hardware (i.e. the more scalable you make something, the more it will tend to harm the smaller hardware). Fortunately most of the scalability issues with BFS are pretty much irrelevant until you have more than 16 logical CPUs, but we've now reached the point where 8 logical CPUs is not unusual for a standard PC. Whether or not we actually need so many logical CPUs or not and can use them appropriately is an argument for a different time and place, but they're here. So I've always had at the back of my mind how I might go about making BFS more scalable in the long term and locking is one obvious area.

Unfortunately time and realistic limitations means I'm unlikely to ever do anything on a grand scale or be able to support it. However I had ideas for how to change the locking long-term but it would require lots of incremental steps. After all that rambling, this post is about the first step to changing it. Like some of the experimental steps of the past (such as skip lists), there is absolutely no guarantee that these are worth pursuing.

I've implemented a variant of the read/write locks as used in the kernel to better suit being used as a replacement for the spinlock that protects all the global runqueue data. The problem with read/write locks is they favour readers over writers, and you cannot upgrade or downgrade locks. Once you have grabbed them, if you drop the lock and then try to grab the other variant (eg. going from read to write), the data is no longer guaranteed to be the same. What I've put together (and I'm not remotely suggesting this is my original idea, I'm sure it's been done elsewhere) are upgradeable read/write locks where you can take either the read lock, write lock, or an upgradeable variant. Upgradeable locks can be up or downgraded to write or read locks, and write locks can be downgraded to read locks, but read locks can not be changed. These URW locks are unfortunately more overhead than either spinlocks or normal rwlocks since they're actually just taking combinations of spinlocks and rwlocks. However the overhead of the locks themself may be worthwhile if it allows you to convert otherwise locked data into sections of parallel read code.

So here is a patch that implements them and can be put into pretty much any recent linux kernel version:
urw-locks.patch

Note that patch does nothing by itself, but here is a patch to add on top of that one for BFS423 that modifies the global runqueue to use the urw locks and implements some degree of read/write separation that did not exist with the regular spinlock:
bfs423-grq_urwlocks.patch

It's rather early code but I've given it a fairly thorough testing at home and it at least seems to be working as desired. On the simplest of preliminary testing I'm unable to demonstrate any throughput advantage on my quad core hardware, but the reassuring thing is I also find no disadvantage. Whether this translates to some advantage on 8x or 16x is something I don't have hardware to test for myself (hint to readers).

Note that even if this does not prove to be any significant throughput gain, then provided it is not harmful, I hope to eventually use it as a stepping stone to a grander plan I have for the locking and scalability. I don't like vapourware and since I haven't finalised the details myself exactly how I would implement them, there's nothing more to say for now. Then there's the time issue... there never seems to be enough, but I only ever hack for fun so it's no problem really.

P.S. Don't let this long post make you not notice that BFS 423 and 3.4-ck2 are also out in the previous post.

Wednesday, 13 April 2011

Scalability of BFS?

So it occurred to me that for some time I've been saying that BFS may scale well only up to about 16 CPUs. That was a fairly generic guess based on the design of BFS, but it appears that these more-thread machines and multi-core machines seem to quite like BFS on the real-world benchmarks I'm getting back from various people. With the latest changes to BFS, which bumped the version up to 0.400, it should have improved further. I've tried googling for links to do with BFS and scalability and the biggest machine I've been able to find that benefits from it is a 24 core machine running F@H (folding at home). Given that this was with an older version of BFS, and that there were actually advantages even at 24 cores, I wonder what the point is where it doesn't scale? Obviously scalability is more than just "running F@H" and will depend entirely on architecture and workload and definition of scalability, and so on, but... I wanted to ask the community what's the biggest machine anyone has tried BFS on, and how well did it perform? If someone had access to 16+ cores to try it out I'd be mighty grateful for your results.

Friday, 25 February 2011

lrzip-0.570 for the uncorrupt.

When I last blogged about lrzip, I mentioned the corruption on decompression issue a user was seeing in the field. This bug, not surprisingly, worried me greatly so I set out on a major hunt to eliminate it, and make lrzip more reliable on decompression. After extensive investigation, and testing on the part of the user, to cut a long story short, the corruption was NEVER THERE.

The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).

Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.

One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.

Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.

Here's the latest results from my standard 10GB virtual image compression test:

Compression  Size         Percentage  Compress Time  Decompress Time
None         10737418240  100.0
gzip         2772899756    25.8          5m47s        2m46s
bzip2        2704781700    25.2         16m15s        6m19s
xz           2272322208    21.2         50m58s        3m52s
7z           2242897134    20.9         26m36s        5m41s
lrzip        1372218189    12.8         10m23s        2m53s
lrzip -U     1095735108    10.2          8m44s        2m45s
lrzip -l     1831894161    17.1          4m53s        2m37s
lrzip -lU    1414959433    13.2          4m48s        2m38s
lrzip -zU    1067075961     9.9         69m36s       69m35s

Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:

Detected lrzip version 0.5 file.
../temp/enwik8.lrz:
lrzip version: 0.5 file
Compression: rzip + lzma
Decompressed file size: 100000000
Compressed file size: 26642293
Compression ratio: 3.753
MD5 used for integrity testing
MD5: a1fa5ffddb56f4953e226637dabbb36a
Rzip chunk 1:
Stream: 0
Offset: 25
Block   Comp    Percent Size
1       lzma    58.1%   867413 / 1493985        Offset: 22687516        Head: 0
Stream: 1
Offset: 25
Block   Comp    Percent Size
1       lzma    28.8%   5756191 / 20000000      Offset: 75      Head: 5756266
2       lzma    28.4%   5681891 / 20000000      Offset: 5756291 Head: 11438182
3       lzma    28.2%   5630256 / 20000000      Offset: 11438207        Head: 17068463
4       lzma    28.1%   5619003 / 20000000      Offset: 17068488        Head: 23554929
5       lzma    28.5%   3087298 / 10841364      Offset: 23554954        Head: 0
Rzip compression: 92.3% 92335349 / 100000000
Back end compression: 28.9% 26642052 / 92335349
Overall compression: 26.6% 26642052 / 100000000

I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.

That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip

Sunday, 12 December 2010

lrzip-0.551, overlapping streams for faster compression

A few bug reports have been trickling in since I last released a version of lrzip, which as I said last time is good and bad. It's helped direct my workflow on this, even though I keep trying to let it settle down. I'd love to leave the code since it mostly works quite well, and there is a common mantra that "release early, release often" is considered good, but in practice that doesn't seem to work. If you release stuff while it's still broken, it will get a bad reputation and people will dismiss it that time around and often not give it a chance again. So if there are show stopper bugs, they need attending to ASAP.

So anyway, the fact is that the fun part is implementing stuff rather than just fixing bugs, so for the next release I worked on doing both since I had to fix the bugs I knew about. I've mentioned before that lrzip breaks data into rzip streams and compresses each stream a block at a time. I had implemented multithreading by splitting that block up according to number of CPUs, but at the end of the stream, it waited to get all the blocks back before moving onto the next stream. For this release I moved the threading up earlier in the overall process, thus allowing the rzip stage to move onto the next stream while the back end threads are still busy compressing data from the previous stream. This means that if a file is large enough (greater than approximately 2/3 ram), once all CPUs are in use with back end threads, they should remain in use until the end of the file is reached. This speeds up compression on large files on SMP machines.

While looking for bugs to do with lzma compression failing due to running out of ram, I discovered why changing from level 7 compression to 9 compression did nothing... It's been a long time since the lzma library was imported into lrzip, and I had completely forgotten that it only goes up to level 7! The default lzma compression level is meant to be 5, so I've scaled it down now, meaning lrzip will now use the best all round compromise level for lzma. This also was affecting the window sizes internally used by the lzma library. At level 5, the windows are 16MB in size, and go up to 64MB for level 7. As one kind person noted in the comments on this blog, lzma uses up to 12x the memory of the compression window, so each thread would have been needing up to 800MB each. This explained why people were getting lzma failing on window sizes and complaining despite apparently enough ram. While the new level does lose slightly on compression, it does also improve compression speed significantly, and should pretty much eliminate lzma compression failing due to "too big window sizes".

With the two speed improvements, here are the sort of speed ups one can see with version 0.550 of lrzip:
10GB virtual image on quad core with 8GB ram:
lrzip 15m45s -> 11m03s
lrzip -M 12m03s -> 9m30s

And just as a reminder, xz completed this in 50m58s and the file size was almost halved by lrzip compared to xz.

Benchmarks can be found here:
http://ck.kolivas.org/apps/lrzip/README.benchmarks

On version 0.544 of lrzip, I had changed multithreading to use two separate groups of threads for stream 0 and stream 1 to make decompression more robust for out-of-order storage in the archive. Well unfortunately this doubled the possible number of threads, and this brought a now all-too-familiar problem: running out of ram on big files. In .550 I changed it to only use one thread for stream 0 rather than multiple ones since it's usually only one block of stream 0 before multiple blocks of stream 1. This can slow down decompression a little more, but reliability is more important.

I put some effort into tackling the dreaded running out of ram problem with multiple threads all competing for ram. What lrzip does now is detect when the compression back end has failed (which is usually for memory reasons) and instead of failing completely, it will then wait for the previous thread to complete its compression before trying again. This serialises work that would have otherwise been in parallel. Thus less ram is required since only one thread is likely to be using ram. I did this on both compression and decompression. Also, I made the lzma compression back end first try a lower compression level (since this uses a smaller window) before failing, and then finally resorting to bzip if it never succeeds. This should make running out of memory a non-fatal problem since it will just try again using less and being slower.

On the subject of constraining memory usage, I finally bit the bullet and dropped the maximum mmap down to only 1/3 of ram since I can use my sliding mmap to do the remaining chunk size. It is a tiny bit slower, but far more reliable, and leaves a lot more ram for the compression back ends. This makes my sliding mmap implementation, which was just an experiment originally, used routinely for any file larger than 1/3 of ram size.

My "spreading out of thread spawning evenly" idea in the last release was a failure. I reverted it since it provided no speed ups and occasionally slowed things down.

I fixed a few other minor errors to do with not callocing enough ram in some structs which looked really scary but I have no idea if they were causing problems. Hopefully they fixed some random problem somewhere :P

So I was very proud of this release and made it go gold as version 0.550...

About 5 minutes after releasing it, I found a couple of new bugs that were pretty nasty. When lzma worked on an incompressible block, it now failed instead of just leaving that block uncompressed. That was an easy fix. I also found that I had broken compression from STDIN a while back and hadn't noticed, and the compression summary at the end was reading 0. So I quickly corrected those and bumped the version up to 0.551.

So get it here:
http://freshmeat.net/projects/lrzip

No doubt more bugs will show up and I'll have to tend to those as well, but this should be a more robust version since a lot of failure cases in 0.544 have been addressed.

Other ideas I'm considering for lrzip at this stage based on requests include a good error testing and recovery mechanism with either reed-solomon or low density parity checking, and encrypted password protection. I have a few libraries I've looked at for both of these but am open to suggestions if people have recommendations.

EDIT: People didn't understand how the chunks and streams work, so I've copied this comment I made below into the blogpost.

---

Lrzip splits files up multiple times. Imagine a file is twice the size of ram. It will first be split into chunks:
A B C
where each chunk is 1/3 the size of the file.

Then the rzip stage is performed on chunk A, splitting it into streams:
A0 A1
where A0 is the "dictionary" and A2 the "matches"

Then each stream is split up into further chunks for the compression backend, chunks
A0A A0B
and then
A1A A1B A1C A1D A1E
and so on, and each chunk is passed onto the compressor (such as lzma) and the results written to disk.

All this used to be done in series.

Version 0.530 made the last chopping up part be done in parallel since the slow compression back end (lzma or zpaq) was the time consuming part of compression. However when once streams A0 and A1 would have finished their rzip stage, lrzip would wait for A0A A0B A1A A1B etc. to finish compressing completely before lrzip would move onto chunk B.

Version 0.551 now allows lrzip to move onto chunk B while A0A A0B A1A A1B etc are all still compressing.

---

This means that on faster compression (lzo, gzip and bzip2), the rate limiting step is the stream production of A0 and A1. The pipe dream is to parallelise those, but they rely on using as much ram as possible to make the rzip stage effective so it would compromise compression greatly (and be messy as hell to implement).

Thursday, 11 November 2010

lrzip scalability improvements

I've been looking at the benchmarks I've accumulated for lrzip's performance and have been mostly pleased. The speed/size performance or pure size performance is very good compared to other compression algorithms, but not much has been done to date to make it scale better on SMP. It scales slightly during lzma compression, but even there it's a clunky threading library attached to it that increases the CPU usage up to a bit over 2 CPUs for about 50% better performance.

So I've set out to make significant improvements to the scalability of lrzip on SMP, knowing it will slightly adversely affect compression. There are two stages to compression on lrzip, the preprocessing rzip stage and the back end compression 2nd stage. Improving the scalability of the first stage looks highly improbable (I'll never admit to anything being impossible). However, if I split up the chunks handed to the back end compressor and spread it over each thread, it should scale better. So the first step I did here is to do precisely that: Grab all the data from rzip and then split it up to a fixed number of block sizes and use every CPU available, provided each block is at least a minimum size to preserve compression efficiency. I've done precisely that and committed it to the git tree:

lrzip repo

Now it would be nice if this made the backend compression 4 x faster on quad core, but there are a number of reasons this doesn't happen. Mainly it's because they're all competing for cache and memory bandwidth, and (less significantly) the blocks end up compressing different degrees so not all threads finish at the same time. Still, the classic 23 minute 10GB benchmark now dropped to 18 minutes which is a fairly significant improvement given that previous lzma compression was already lightly threaded. The file was only marginally bigger so it's definitely a win.

Next up is to try and break up the rzip stream and hand it to the compression back end before the stream has even finished, to parallelise the rzip phase with the compression phase. This will not likely get to use *every* CPU at the same time (except in the case of zpaq), but will actually be faster than the previous approach. This work is much more invasive than the previous one and is currently underway.

After that I'm going to look at decompression and try and parallelise the back end decompression phase. While decompression is already quite fast, who can ever complain about more speed?

All in all this has been loads of fun to hack on. I'd never thought I'd say this, but maybe it would be nice to have a sea change and get paid to hack part-time and do medicine only part-time.