-ck hacking: 4.7

Showing posts with label 4.7. Show all posts

Saturday 12 November 2016

linux-4.8-ck7, MuQSS version 0.140

Another week has passed, another stable linux release, and to follow, another -ck and MuQSS release.

linux-4.7-ck7 patch:
patch-4.8-ck7.lrz

Split out patches:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck7/patches/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_140.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_140.patch

This release marks a change towards conservative changes only.

I've rolled back the extensive timer changes outside the main scheduler code. There are too many assumptions made about timeouts in the kernel code that are potentially problematic in the real world, and there is code that is poorly prepared for freezer usage (suspend to ram) that breaks. Additionally, not a single user reported a workload that they noticed benefited from the lower latency accurate timeouts. Finally, the added overhead is demonstrable in throughput benchmarks, and when doing comparisons with mainline it is doing MuQSS a disservice to mix in other code that it's not actually responsible for.

There are also a small number of bugfixes for warnings/crashes in the updated MuQSS that showed up after the last release as people are using it on more and varied hardware in the wild now. These may have positive effects on other less defined issues in the wild too.

The -ck release also includes an updated version of BFQ. Along with this updated version, I would like to issue a warning regarding BFQ. I have heard rumour that a number of users have reported filesystem corruption with the combination of BTRFS and BFQ. If you are using this filesystem, I urge you to not compile in BFQ at all, or at the very least not make it default to BFQ, using it selectively on devices you are running a different filesystem (I still recommend people use ext4.) I would like to encourage users who have run into this problem to report it to the BFQ maintainer.

I've cleaned up the patches in the -ck tarball once again to include only the changes in combined related patches. This will ease the burden of porting to the next major linux kernel release and allow users to easily select which patches they wish to use themselves.

As always, make sure to give me your feedback, bug reports, warnings, and bitcoin.

Enjoy!
お楽しみ下さい
-ck

Saturday 5 November 2016

linux-4.8-ck6, MuQSS version 0.135

Announcing a new version of MuQSS and a -ck release

4.8-ck6 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck6/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_135.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_135.patch

Git tree:
https://github.com/ckolivas/linux

A week has passed since the last major update to BFS and -ck was posted, allowing me to concentrate on receiving and responding to any bug reports. As it turns out, there were very few apart from the recurring local_softirq_pending warning/stalls. This is nice because it means MuQSS is mostly ~stable now. Mainline has even had more "stable" releases in the same time as MuQSS for 4.8, moving to 4.8.6 in the interim.

In this version I've added aggressive handling of pending softirqs in the hope the warnings and stalls all go away. The true reason the handling of softirqs are being dropped still escapes me but is likely related to the fact that MuQSS does a lot of lockless rescheduling across CPUs to decrease overhead but this does not give guarantees that locking would.

Additionally, I've added a number of APIs to the kernel to do specified millisecond schedule timeouts which use the highres timers which are mandatory now for MuQSS. The reason for doing this is there are many timeouts in the kernel that specify values below 10ms and the timer resolution at 100Hz only guarantees timeouts under 20ms.

I've also added a code sweep across the entire kernel looking for timeout calls under 50ms and use the new interface in its place. Additionally there are numerous places where schedule_timeout(1) are used in the kernel where a "minimum timeout" is expected, yet this is entirely Hz dependent, again being up to 20ms in duration. I've replaced all these with a 1ms timeout, emulating what would happen on a 1000Hz kernel, but without the overhead of running the higher Hz kernel. I'm not entirely sure this will equate to any real world improvements but the fact it's used in things like audio drivers worries me that it might.

Finally I've replaced the standard msleep call from userspace to use highres timers, in case there are userspace applications that expects msleep to actually give some kind of sleep that resembles what's asked of it, instead of something Hz limited, in case this is leading to slowdowns in userspace due to assumptions on the userspace coders' part. Calls to msleep() from userspace now give 100us accuracy at 100Hz instead of 20ms.

All these timing changes add overhead since they're trying to emulate the timing accuracy of running at 1000Hz but in a latency-focused scheduler I believe they're appropriate, and they do not incur the overhead that actually changing Hz would incur. Additionally they add accuracy to timers and timeouts that 1000Hz does not afford.

In the -ck tarball of broken-out patches, I've kept these timer changes separate to allow the muqss scheduler to be applied by itself should they prove problematic, and they will make merging with future kernels easier.

Enjoy!
お楽しみください
-ck

Saturday 29 October 2016

linux-4.8-ck5, MuQSS version 0.120

Announcing a new version of MuQSS and a -ck release to go with it in concert with mainline releasing 4.8.5

4.8-ck5 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck5/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_120.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_120.patch

Git tree:
https://github.com/ckolivas/linux

This is a fairly substantial update to MuQSS which includes bugfixes for the previous version, performance enhancements, new features, and completed documentation. This will likely be the first publicly announced version on LKML.

EDIT: Announce here: LKML

New features:
- MuQSS is now a tickless scheduler. That means it can maintain its guaranteed low latency even in a build configured with a low Hz tick rate. To that end, it is now defaulting to 100Hz, and it is recommended to use this as the default choice for it leads to more throughput and power savings as well.
- Improved performance for single threaded workloads with CPU frequency scaling.
- Full NoHZ now supported. This disables ticks on busy CPUs instead of just idle ones. Unlike mainline, MuQSS can do this virtually all the time, regardless of how many tasks are currently running. However this option is for very specific use cases (compute servers running specific workloads) and not for regular desktops or servers.
- Numerous other configuration options that were previously disabled from mainline are now allowed again (though not recommended for regular users.)
- Completed documentation can now be found in Documentation/scheduler/sched-MuQSS.txt
Bugfixes:
- Fix for the various stalls some people were still experiencing, along with the softirq pending warnings.
- Fix for some loss of CPU for heavily sched_yielding tasks.
- Fix for the BFQ warning (-ck only)

Enjoy!
お楽しみ下さい
-ck

Monday 24 October 2016

linux-4.8-ck4, MuQSS CPU scheduler v0.116

Yet another bugfix release for MuQSS and the -ck patchset with one of the most substantial latency fixes yet. Everyone should upgrade if they're on a previous 4.8 patchset of mine. Sorry about the frequency of these releases but I just can't allow a known buggy release be the latest version.

4.8-ck4 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck4/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_116.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_116.patch

I'm hoping this is the release that allows me to not push any more -ck versions out till 4.9 is released since it addresses all remaining issues that I know about.

A lingering bug that has been troubling me for some time was leading to occasional massive latencies and thanks to some detective work by Serge Belyshev I was able to narrow it down to a single line fix which dramatically improves worst case latency when measured. Throughput is virtually unchanged. The flow-on effect to other areas was also apparent with sometimes unused CPU cycles and weird stalls on some workloads.

Sched_yield was reverted to the old BFS mechanism again which GPU drivers prefer but it wasn't working previously on MuQSS because of the first bug. The difference is substantial now and drivers (such as nvidia proprietary) and apps that use it a lot (such as the folding @ home client) behave much better now.

The late introduced bugs that got into ck3/muqss115 were reverted.

The results come up quite well now with interbench (my latency under load benchmark) which I have recently updated and should now give sensible values:

https://github.com/ckolivas/interbench

If you're baffled by interbench results, the most important number is %deadlines met which should be as close to 100% as possible followed by max latency which should be as low as possible for each section. In the near future I'll announce an official new release version.

Pedro in the comments section previously was using runqlat from bcc tools to test latencies as well, but after some investigation it became clear to me that the tool was buggy and did not work properly with bfs/muqss either so I've provided a slightly updated version here which should work properly:

runqlat.py

Enjoy!
お楽しみ下さい
-ck

Monday 17 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.112

Here's an updated version of MuQSS.

For 4.8.*:
4.8-sched-MuQSS_112.patch

For 4.7.*:
4.7-sched-MuQSS_112.patch

Git tree here as 4.7-muqss or 4.8-muqss branches:
https://github.com/ckolivas/linux

It's getting close now to the point where it can replace BFS in -ck releases. Thanks to the many people testing and reporting back, some other misbehaviours were discovered and their associated fixes have been committed.

In particular,
- Balancing across CPUs was not looking at higher and lower scheduling policies correctly (SCHED_ISO, SCHED_IDLEPRIO and realtime policies)
- A serious stall/hang could happen with tasks using sched_yield (such as f@h client and numerous GPU drivers)
- Some minor accounting issues on new tasks with affinity set were fixed
- Overhead was further decreased on task selection
- Spurious preemption on CPUs where the preempted task had already gone are now avoided
- Spurious wakeup on CPUs that were assumed and are no longer idle are avoided
- A potential race in suspending to ram was fixed
- Old unused code from BFS was removed, along with unnecessary intermediate variables.
- Clean ups
- Some work towards actually documenting MuQSS in Documentation/scheduler/sched-MuQSS.txt was done, though incomplete.

Enjoy!
お楽しみ下さい
-ck

Tuesday 11 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.111

Lots of bugfixes, lots of improvements, build fixes, you name it.

For 4.8:
4.8-sched-MuQSS_111.patch

For 4.7:
4.7-sched-MuQSS_111.patch

And in a complete departure from BFS, a git tree (which suits constant development like this, unlike BFS's stable release massive ports):

https://github.com/ckolivas/linux

Look in the pending/ directory to see all the patches that went into this or read the git changelog. In particular numerous warnings were fixed, throughput improved compared to 108, SCHED_ISO was rewritten for multiple queues, potential races/crashes were addressed, and build fixes for different configurations were committed.

I haven't been able to track the bizarre latency issues reported by runqlat and when I try to reproduce it myself I get nonsense values of latency greater than the history of the earth so I suspect an interface bug with BPF reporting values. It doesn't seem to affect actual latency in any way.

EDIT: Updated to version 0.111 which has a fix for suspend/resume.

Enjoy!
お楽しみ下さい
-ck

Wednesday 5 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.106

Another day and time for yet another release.

There are 0.106 versions and incrementals available for linux-4.7:
http://ck.kolivas.org/patches/muqss/4.0/4.7/
and linux-4.8:
http://ck.kolivas.org/patches/muqss/4.0/4.8

Two large remaining races that could lead to warnings, stalls, or in the worst case, crashes, have been fixed in this version.

Additionally the multiple-runqueue locking has been significantly optimised to take only the runqueues needed for as long as they're needed only and dropped as soon as possible which should bring the lock contention levels down even further. This is a performance enhancement, more so in non-interactive mode, though it will only start being demonstrable if you're lucky enough to have many CPUs.

This version addresses all the known bugs and warnings I've received to date so hopefully I can have a little rest and let people out there actually give it a go. What will you expect if you use this instead of BFS? If I've done this correctly, you will notice absolutely no difference since the idea was to preserve the interactivity and responsiveness of BFS and make it scalable to more CPUs than most people can afford.

Keep the feedback coming, thanks.

Enjoy!
お楽しみ下さい
-ck

Tuesday 4 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.105

I spent the last few days fighting with various lock debugging techniques and the numerous bug reports and am pleased to announce a new version of MuQSS, version 0.105

There are versions and incrementals available for linux-4.7:
http://ck.kolivas.org/patches/muqss/4.0/4.7/
and linux-4.8:
http://ck.kolivas.org/patches/muqss/4.0/4.8

If you've been waiting for me to say it's stable enough to try, then now's your chance for I've addressed all known bugs at this time and it's working well for me.

Most of the issues were to do with races and unstable handling of cross-cpu task movement. No effort went into improving performance from 104 though this version should address many of the crashes and hangs that have been reported with earlier versions.

Additionally there is a pending patch being uploaded for BFS512 which, as per usual, had some last minute issues that only just showed up. If enough users complain loudly enough or more issues show up I might just release another bfs and -ck since it should be stable, especially being one of the last BFS releases.

http://ck.kolivas.org/patches/bfs/4.0/4.8/Pending/

Keep the feedback and bug reports coming. Next I need to put more care into the non-interactive mode of muqss for your enjoyment.

Enjoy!
お楽しみ下さい
-ck

Saturday 1 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.105

Announcing a multiple runqueue variant of BFS, with the more mundane name of MuQSS (pronounced mux) for linux 4.7:

Full patch for linux-4.7
4.7-sched-MuQSS_105.patch

Keep watching this blog for newer versions!

Incremental to patch bfs502 to MuQSS 0.1:
bfs502-MuQSS_103.patch

It was inevitable that one day I would find myself tackling the 2 major scalability limitations in BFS and this is the result of it. These two issues were

The single runqueue which means all CPUs would fight for lock contention over the one runqueue, and
The O(n) look up which means linear increase in overhead for task lookups as number of processes increases.

As you're all aware by now, skiplists were recently introduced into BFS to tackle number 2 with a modest improvement in throughput at high loads.

Till now I did not have the energy nor time to try and find a solution for number 1. that maintained BFS' scheduling decision algorithm as the single runqueue was actually the reason latency remains bound and deterministic on BFS, capitalising with more CPUs instead of fighting against them for scalability.

This scheduler variant is an evolution of BFS, which hopefully will be mature enough to replace BFS one day when stability is assured. It is able to still use the same scheduling algorithm as BFS meaning latency and responsiveness remains as good as always, but with the per-CPU runqueue and discrete locking, it also means it will scale to any number of CPUs, as the mainline scheduler does.

It does NOT guarantee the best possible throughput as there still is virtually no complex balancing mechanism whatsoever, selecting tasks according to deadline primarily with only CPU cache distances being used to determine which idle CPU to go to, or in non-interactive mode, which overloaded CPU to pull from to fill an idle CPU.

It would be possible, with a lot of effort, to wedge the entire balancing algorithm for scalability from mainline into this, though it will probably offset the deterministic latency that makes it special.

This is a massive rewrite and consequently there are bound to still be race conditions and hidden bugs though I have been running it for a while now with reasonable stability. I'm putting this out there for the braver people to test. There's a lot more to document about it but for now let's just say, give it a try.

Please don't use any lock debugging as it will light up every possible complaint for the time being!

Regarding 4.8, for the time being I will still be releasing BFS for it and incorporate it into -ck

EDIT: Updated to version 0.105 with significant bugfixes.

Enjoy!
お楽しみ下さい
-ck

Friday 23 September 2016

BFS 502, linux-4.7-ck5

With the fix for the last of the freezes with BFS497 becoming clearer and a number of other minor issues being attended to, such as build failures and minor improvements accumulating, I'm releasing a new BFS that combines all into yet another release, which should be the last of the releases for the 4.7 kernel.

BFS by itself:
4.7-sched-bfs-502.patch

-ck patches with BFS:
4.7-ck5

In addition to the update to BFS, this -ck release is the first in a very long time to include a patch from another developer - the Throttled background buffered writeback v7 patch by Jens Axboe. This makes a massive difference to a system's ability to read files, open new applications etc. under heavy write loads in my testing and is a change which I believe is essential and will eventually make its way into the mainline kernel.

The changes to BFS 502 are as follows:

bfs497-build_other_arches.patch
bfs497-no_smtload_avg.patch
bfs497-recognise_nodes2.patch
bfs497-revert-othercpufreq.patch
bfs497-fix_smt_nonice.patch

A build fix for building on other architectures (notably ARM).
Simplifying the load measurement on SMT machines reported to cpufreq - trying to account for load on the SMT sibling is unnecessary as each core will run at the speed of the most loaded sibling anyway on any existing hardware.
A fix for detecting CPUs on other NUMA nodes and setting their locality correctly.
Not trying to signal CPU load to cpufreq on other CPUs when tasks migrate - this was leading to the hangs and there is enough rescheduling for cpufreq to get the load later on.
A build fix for when SMT_NICE is not configured.

Enjoy!
お楽しみ下さい
-ck

Tuesday 13 September 2016

BFS 497, linux-4.7-ck4

For the first time in a very long time, I'm announcing yet another -ck release up to ck4 along with yet more substantial updates for BFS for linux-4.7 based kernels.

BFS by itself:
4.7-sched-bfs-497.patch

-ck branded linux-4.7-ck4 patches:
linux-4.7-ck4

Thanks(?) to the massive changes to the mainline kernel I'd been forced to rewrite significant components of BFS to work properly with them, specifically the cpu frequency governors. At the same time I've had quite a bit of energy and enthusiasm for working on BFS in a way I haven't had in a long time. As a result, this updated version not only addresses the remaining cgroup stub patch bug (mentioned on the previous announcement) but implements further improvements and clean ups to go with those improvements.

Alas I still have no explanation for the random lockups some people are seeing, but I have seen reports of it happening on mainline kernels as well now, so while I'm always suspicious of my own code, there is also the chance that BFS exacerbates an issue in mainline. Something that appears common is onboard Intel graphics with the Haswell chipset.

Additionally I had reports of people being unable to suspend with BFS from 4.7 but I haven't heard back from them on later versions.

The short summary of improvements in this version are less overhead, higher throughput and less latencies.

I've rewritten the skiplist implementation to not require a malloc/free on insertion/removal of a new node which seemed to noticeably improve throughput at high loads.
Now that CPU frequency governors know what the scheduler is doing, the approach of BFS of old of knowing what the governor was doing and working around it is no longer helpful and I've removed the whole sticky task and offset for throttled CPUs and throughput has actually improved instead.
I've also added some micro-optimisations and cleanups.
I've added a minor change for offlining CPUs to prevent tasks trying to schedule to them.

The set of patches in ck4 is the largest in the ck patchset since the early 2.6 patchset days. I've also included the patch from Alfred (thanks!) to fix the warning that happens with suspend which is mostly harmless.

Each patch included has a mini changelog at the top.

I'm also keen to get feedback from people on if they see any noticeable interactive/responsiveness regressions by disabling the interactive flag as follows:

echo 0 > /proc/sys/kernel/interactive

Enjoy!
お楽しみ下さい
-ck

Wednesday 7 September 2016

BFS 490, linux-4.7-ck3

Announcing yet another substantial update for BFS for linux-4.7 based kernels.

BFS by itself:
4.7-sched-bfs-490.patch

-ck branded linux-4.7-ck3 patches:
linux-4.7-ck3

Following on from the large update to BFS in 480 to skip lists, numerous regressions became apparent, the bulk of which were related to doing a poor job of signalling cpu load to the various cpufrequency governors. Some were affected badly, others not so, but there were plenty of helpful people giving feedback about those regressions which encouraged me to slowly but surely chip away at the problems. Additionally, there were some minor behavioural regressions which were oversights during the updates to BFS 480. Finally the rudimentary cgroup stub patch would crash the system.

As the number of patches required to address these issues got larger and larger, it became hard for people on this blog to keep up with the changes so I've released 490 which hopefully should address the bulk of these issues - there are patches in there that haven't been posted on this blog, but I've included all of them with a brief description in the incremental/ directory for your perusal.

Anyway it is much easier for people to grab the latest version which includes all of those changes, including the updated cgroups stub patch.

EDIT: Here's a patch to make cgroup stubs safer cgroup-stubs-safe2.patch

Enjoy!
お楽しみ下さい
-ck

Friday 2 September 2016

BFS 480 with skip lists, linux-4.7-ck2

Announcing a major update for BFS for linux-4.7 based kernels.

BFS by itself:
4.7-sched-bfs-480.patch

-ck branded linux-4.7-ck2 patches:
linux-4.7-ck2

This is the largest BFS update in a long time. The various problems that had been accumulating forced me to spend a more extended period fixing BFS to work with the latest mainline changes and encouraged me to overhaul some areas that had long been needing it.

The changes are:

Fixed the crash when SMT NICE is configured in on a CPU without SMT.
Added my skiplist implementation.
Converted BFS from its long-standing O(n) lookup to use skiplists.
Fix crash when SMT NICE is enabled on some hardware
Fix try_preempt missing the locality diff effect in non-interactive mode
Ignore busy threads/caches when still on the same core
Reworked the testing of idle threads and cores for less overhead and to correctly identify idle siblings
Fix the CPU load that's passed to the cpu frequency governor, fixing a crash and non-working schedutil governor.

The short summary is I've fixed a number of showstopper bugs on the last version, and improved throughput .

Actually incorporating the skiplists that I had experimented with a long time ago was decided on by the fact that I was able to trim the skiplist overhead further and maintain identical semantics for process selection (maintaining interactivity) whereas on the previous experiment I had never completed the work. Throughput testing shows virtually identical performance on normal workloads and theoretically would be helpful in extreme overload cases.

The original post regarding skip lists was here:
bfs-and-skip-lists.html

This now means that BFS is no longer O(n) lookup after O(1) insertion. It is now O(log(n)) insertion, O(1) lookup and O(k) removal where k <= 16, thereby tackling a long-standing criticism of the overall design.

I did not find a specific cause for peoples' inability to suspend to ram so I doubt this has been fixed despite the large code update.

The list of patches making up bfs480 is as follows:

bfs472-fix_set_task_cpu.patch
skiplists.patch
bfs472-skiplist.patch
bfs-delay-smt-siblings.patch
bfs-fix-noninteractive-try-preempt.patch
bfs-ignore_local_busy.patch
bfs-rework-idles.patch
bfs-fix-schedutil.patch
bfs-v480.patch

As always I'm giving this to you not long after I've finished coding it so all the usual warnings apply, especially with an update of this size.

EDIT: Uniprocessor build fix: bfs480-fix-upbuild.patch
EDIT2: Here is a test patch to try and improve cpufreq behaviour: bfs480-rework_cpufreq.patch

Enjoy!
お楽しみ下さい
-ck

Friday 29 July 2016

BFS 472, linux-4.7-ck1

Announcing an updated BFS for linux-4.7 based kernels.

BFS by itself:
4.7-sched-bfs-472.patch

-ck branded linux-4.7-ck1 patches:
linux-4.7-ck1

This was quite a substantial merge effort this time around with a fair amount of changes in mainline kernel that affected the patch. Nonetheless everything appears to be working as planned in my limited testing. I'm unsure if the changes will fix the problems people had with suspend during the 4.6-bfs patches but the new code does touch that area. I was never affected on any of my machines so was unable to reproduce the problem in the first place.

In addition to the resync, a few minor changes have made their way into this release with respect to the way tasks preempt other tasks. See bfs470-updates.patch for details.

One other fairly significant change was properly hooking into the new schedutil parameters that drive cpufreq scaling governors. What I committed into bfs470 would not have been working properly in choosing the correct CPU frequency to run at and may have led to slowdowns and/or more power usage. This should be fixed in 472.

I should also mention that if, like me, you use the evil proprietary nvidia driver, the latest will not build with the current kernel and you'll need a couple of patches to get it working.

Enjoy!
お楽しみ下さい
-ck

EDIT: This patch will fix crashes when configured without SMT_NICE enabled:
bfs472-fix_set_task_cpu.patch
And will be applied to the next BFS release.