The MuQSS CPU scheduler

April 20, 2017

This article was contributed by Nur Hussein

The scheduler is a topic of keen interest for the desktop user; the scheduling algorithm partially determines the responsiveness of the Linux desktop as a whole. Con Kolivas maintains a series of scheduler patch sets that he has tuned considerably over the years for his own use, focusing primarily on latency reduction for a better desktop experience. In early October 2016, Kolivas updated the design of his popular desktop scheduler patch set, which he renamed MuQSS. It is an update (and a name change) from his previous scheduler, BFS, and it is designed to address scalability concerns that BFS had with an increasing number of CPUs.

BFS

To understand the updates to the scheduler in MuQSS, we first take a look at the design of BFS, which is the basis for the new patch set. First released in 2009, BFS was a simplified scheduler that was made in response to Kolivas's issues with the mainline scheduler for desktop use. The mainline scheduler has to work well for CPU-bound tasks as well as keeping the desktop smooth and snappy, and it is a difficult balancing act to achieve. Kolivas does not believe in a one-size-fits-all scheduler, so he crafted his own specifically for desktop kernels.

BFS implements a single, global queue of tasks. There is no priority modification based on sleep behavior, and all priorities are set according to the relevant process's nice value. Kolivas argues that any kind of interactivity estimation algorithm just doesn't work well enough and triggers a lot of false positives when trying to find which tasks are interactive. The single run queue design was chosen because Kolivas wanted a scheduler that made global decisions independently of the originating CPU of the processes on the system, and to avoid cycling through a number of different queues to find the next task to run.

BFS uses an earliest eligible virtual deadline first (EEVDF) algorithm to decide which tasks get to run when. Every task entering the run queue gets a time slice and a deadline. The time slice (rr_interval), determines the amount of CPU time every task receives. The deadline is computed as a function of current time plus the product of rr_interval and the priority level, which is the nice level in ratio form. Processes with a lower nice value (hence higher priority) get deadlines that are sooner than lower-priority ones. Therefore a high-priority task may get to run many times before a lower priority task's deadline is reached. When a task's time slice expires, it will be requeued with a new deadline. However if the task sleeps or is preempted by a higher-priority task, its deadline is not readjusted; instead, the task is just put back on the queue where it will be attended to again in the next round of scheduling.

The rr_interval tunable is set at 6ms by default. The reason for this is the threshold for human perception of jitter is just under 7ms. Kolivas predicts a common case of between zero and two running tasks per CPU and, in such a scenario, a task will need to wait no longer than 6ms to get service.

BFS does away with explicit CPU load rebalancing algorithms between CPUs, instead opting for selecting the CPU for task execution when scheduling every process that wakes up. The global run queue will select an available idle CPU for the next task that is ready to run. When selecting a CPU, the scheduler will try to select the last CPU the task was running on. Failing that, the CPU selection moves "outward", next trying any hyperthreaded or core siblings that share cache. CPUs in different sockets or on different NUMA nodes are treated as "cache distant", and are the least preferred when selecting a CPU for a task to run on.

Other simplifications in the design of BFS include the elimination of most of the tunable knobs and the lack of support for control groups. Kolivas argues that these are features desktop users have no interest in, and an excess of tuning knobs just creates confusion.

MuQSS

Due to BFS's single run queue of tasks across all CPUs with a global lock, Kolivas reports that it suffers from lock contention when the number of CPUs increased beyond 16. Every time the scheduler wanted to make a decision on which task should go next, it needed to scan the entire run queue for one with the earliest deadline. Also, iterating over a linked list led to cache-thrashing behavior.

MuQSS is BFS with multiple run queues, one per CPU. Instead of linked lists, the queues have been implemented as skip lists. First described by William Pugh in 1990, skip lists are probabilistic data structures that give some of the performance advantages of a balanced binary tree (such as the red-black tree used in CFS), but with an implementation that is computationally less expensive — if less deterministically efficient. Kolivas's implementation is a custom skip list for his scheduler where the "levels" are static eight-entry arrays. The size of the array was chosen to be exactly one cache line wide.

The deadlines for MuQSS now use "niffies" for keeping track of time. Niffies are a nanosecond-resolution monotonic counter, the value of which is obtained from the high resolution TSC timers. Niffies were introduced sometime in 2010 for BFS, which initially used jiffies for calculating the deadline. The virtual deadline algorithm is essentially the same as BFS:

    virtual_deadline = niffies + (prio_ratio * rr_interval)

Again, prio_ratio is the nice level of a task normalized as a ratio and rr_interval is the time slice that the task gets at its nice level. When a task is inserted into the skip list queue, it is ordered by the value of its virtual deadline. As a result, the scheduler can find the next eligible task to run in O(1) time, as that task will be at the head of the queue. Insertion into the skip list is done in O(log n) time.

When selecting a task to run, the scheduler will do a lockless examination of every CPU's run queue to find an eligible task to run, picking the one with the nearest deadline. The scheduler will use a non-blocking "trylock" attempt when popping the chosen task from the relevant run queue, but will simply move on to the next-nearest deadline on another queue if it fails to acquire the queue lock. Although this means the scheduler sometimes doesn't get to run the task with the absolutely nearest deadline all the time, it does alleviate the problem of lock contention among different CPU queues.

MuQSS, like BFS, is aware of the topology of the logical CPUs, whether they are hyperthreaded siblings, cores sharing a package, or NUMA. The hyperthreading awareness of the scheduler means that an idle core, if available, will be selected before a hyperthreaded sibling CPU to avoid slowdowns due to the CPU's processing capacity being shared between tasks on on the same core. When two tasks are executing on a hyperthreaded core, the lower-priority task will have its time limited to allow the higher-priority task to get more CPU time. Also included is a feature called "SMT Nice", which effectively proportions CPU time on hyperthreaded siblings to reflect their nice priorities.

BFS (and subsequently, MuQSS) introduced two new scheduler policies called SCHED_ISO and SCHED_IDLEPRIO. SCHED_ISO, or isochronous scheduling, is a policy that allows unprivileged processes to gain quasi-realtime behavior. A task set to SCHED_ISO will preempt SCHED_NORMAL tasks as long as said SCHED_ISO task does not exceed the default CPU usage of 70% (this is a tunable value) over a rolling average of five seconds. If more than one SCHED_ISO task is running at the same time, they will execute round-robin. This is to prevent other tasks from starving. If a SCHED_ISO task exceeds the 70% threshold it is demoted back temporarily to SCHED_NORMAL and is appropriately scheduled until enough time has elapsed that the average CPU usage of the task dips below 70% again. Since the 70% threshold counts for all available CPUs on a system, it is possible to entirely use an available CPU for SCHED_ISO tasks on SMP machines. For example, a single SCHED_ISO task on a dual-core machine can, at most, only consume 50% of total CPU capacity.

SCHED_IDLEPRIO is the opposite of SCHED_ISO in that it forces tasks to be ultra-low priority. Although it is similar to SCHED_IDLE in the mainline scheduler, there is subtle difference: the mainline scheduler will still eventually run SCHED_IDLE tasks even if there are other, higher priority tasks running at the same time, but SCHED_IDLEPRIO tasks will only run when absolutely nothing else is demanding the CPU's attention. This is useful for batch tasks that scavenge for idle CPUs such as SETI@Home. To avoid deadlocks, if a SCHED_IDLEPRIO task is holding a shared resource, such as a mutex or semaphore, it will be promoted back to SCHED_NORMAL temporarily to allow it to run even if there are other higher-priority processes in the queue.

There is a limited set of tunables which can control the behavior of the scheduler:

rr_interval which is the CPU quantum, which defaults to 6ms.
interactive, a tunable to toggle the deadline behavior. If disabled, searching for the next task to run is done independently on each CPU, instead of across all CPUs.
iso_cpu, which determines what percentage of CPU time, across a rolling five-second average, that isochronous tasks will be allowed to use.

Comparison to the mainline kernel

Kolivas has not tried to merge his scheduler into the mainline kernel, and it is unlikely he will try to as his scheduler patches were done with his own desktop use case in mind. However, his patch set has a following amongst some Linux desktop users who report a "smoother" desktop experience from using it. Previous throughput benchmarking efforts comparing BFS to the mainline CFS did not conclusively demonstrate improvements in Kolivas's patch sets one way or another. The MuQSS scheduler has reportedly better Interbench benchmark scores than CFS. However, ultimately, it is hard to quantify "smoothness" and "responsiveness" and turn them into an automated benchmark, so the best way for interested users to evaluate MuQSS is to try it out themselves.

[I would like to thank Con Kolivas for his help in answering questions about MuQSS.]

Index entries for this article
Kernel	Scheduler
GuestArticles	Hussein, Nur

The MuQSS CPU scheduler

Posted Apr 20, 2017 18:34 UTC (Thu) by flussence (guest, #85566) [Link] (3 responses)

>it is hard to quantify "smoothness" and "responsiveness" and turn them into an automated benchmark
That's true, it's a pain to quantify things like microstutter in multimedia - you need extra hardware to measure that. MuQSS doesn't make Linux magically perform like BeOS all the time.

But my favourite number to bring up is throughput: I used to run Folding@Home (entirely CPU-bound, MPI-heavy, scientific number-crunching), and took note of every little tweak available at the time. Transparent huge pages was something like 2-3% speedup. Going from CFS to BFS was 25%.

The MuQSS CPU scheduler

Posted Apr 21, 2017 5:08 UTC (Fri) by Otus (subscriber, #67685) [Link]

>> it is hard to quantify "smoothness" and "responsiveness" and turn them into an automated benchmark
> That's true, it's a pain to quantify things like microstutter in multimedia - you need extra hardware to measure that.

Oughtn't it be measurable using something similar to the frame-time analysis they nowadays do for game benchmarking?

The MuQSS CPU scheduler

Posted Apr 21, 2017 13:31 UTC (Fri) by Sesse (subscriber, #53779) [Link] (1 responses)

If you can actually measure 25% speedup from CFS to BFS on a repeatable benchmark, I'm sure the CFS people would love to take a bug report. Anything objective and quantifiable is great news.

The MuQSS CPU scheduler

Posted Apr 21, 2017 17:39 UTC (Fri) by flussence (guest, #85566) [Link]

I believe this paper had a similar number: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

That's a bit more thorough than any bug report I could ever write.

Why not mainline it?

Posted Apr 21, 2017 1:02 UTC (Fri) by droundy (subscriber, #4559) [Link] (9 responses)

I can understand Kolias not being interested in the effort to do so, but it sounds like a scheduler that would be of wide interest. Is there any other reason not to mainline this?

Why not mainline it?

Posted Apr 21, 2017 2:18 UTC (Fri) by conman (guest, #14830) [Link] (2 responses)

Linus has said in the past that he absolutely detests specialisation and doesn't want more than one scheduler in mainline. It's the same reason the "plugsched" pluggable CPU scheduler framework that went along with the original alternative schedulers, staircase, RSDL and staircase deadline was abandoned. Therefore it would have to replace the mainline scheduler en-bloc. As a hobby amateur coder working on a scheduler in one's spare time, there would be zero chance of creating something that meets all the criteria of trumping every mainline performance benchmark and feature to replace CFS.

Why not mainline it?

Posted Apr 21, 2017 4:57 UTC (Fri) by liam (subscriber, #84133) [Link]

Your comment brought this to mind

https://lists.linuxfoundation.org/pipermail/ksummit-discu...

It's an unfortunate thing that success can have downsides.

Lots of block schedulers though.

Posted Apr 30, 2017 5:11 UTC (Sun) by gmatht (subscriber, #58961) [Link]

I find it ironic that the LWN story next to this was "Two new block I/O schedulers for 4.12".

Any reason why? I guess there is a big single jump from rotational to SSD block devices that justifies at least two specialised block schedulers.

Why not mainline it?

Posted Apr 21, 2017 6:24 UTC (Fri) by mtaht (guest, #11087) [Link]

Ironically I have been looking into similar methods for better handing packet scheduling across multiple queues and pieces of hardware to eliminate microbursts and the like. I'd settled on skip lists, a similar runqueue, and a similar deadline scheduler... long before reading this article. Also key is trying to push a few difficult things back into hardware where they belong so things can scale better past 10gigE.

That is a hobby project just now, I'm not trying to change the world, just take a fresh look at the design space.

Why not mainline it?

Posted Apr 21, 2017 13:29 UTC (Fri) by Sesse (subscriber, #53779) [Link] (4 responses)

The primary reason is that there are no actual tests indicating it's any better, short of hearsay. When BFS first came out, the Android team did blind user studies, and it didn't live up to the hype.

Why not mainline it?

Posted Apr 21, 2017 14:04 UTC (Fri) by epa (subscriber, #39769) [Link] (3 responses)

This sounds like the biggest reason to have pluggable schedulers: so the system can randomly switch between schedulers every few minutes and not tell you which one is running. Then you could report your subjective experience of smoothness by pressing a button or something, and have a proper blind trial.

Why not mainline it?

Posted Apr 21, 2017 17:25 UTC (Fri) by iabervon (subscriber, #722) [Link] (2 responses)

Pluggable schedulers would be a much harder problem, because most schedulers look at what processes did in the recent past in order to decide about policy. So, at best, switching schedulers will detune your performance when it happens.

That sort of comparison would probably just tell you which scheduler benefits least from tracking, since you probably won't be able to notice the difference in smoothness between two schedulers at steady state, as compared to the glitch when you switch to one that needs its information on what you're watching.

Why not mainline it?

Posted Apr 23, 2017 17:17 UTC (Sun) by epa (subscriber, #39769) [Link] (1 responses)

Right, so the period of switching between the two should be reasonably long (though surely ten minutes is enough?) and any user reports soon after switching to a new scheduler would have to be disregarded (again, surely one minute is enough for the scheduler to be warmed up again?).

Are you saying that, on a typical interactive workload, a scheduler tunes its decisions using more than just the last few seconds of activity?

Why not mainline it?

Posted Apr 23, 2017 21:22 UTC (Sun) by iabervon (subscriber, #722) [Link]

I'd expect, with a typical interactive workload, there are: (1) some things that are going to do work when some external, invisible to the user, trigger occurs (web page loads/updates); (2) some things that happen continuously (video); and (3) some things that are going to do work which will be visible to the user, but don't have anything to do until the user does something (rendering text the user types). Prioritizing (3) when it has something to do over (1) requires paying attention when the user does something, which could be arbitrarily long ago.

The MuQSS thesis is that that kind of tracking isn't really beneficial (i.e., you can get (3) enough time despite (1) based on behavior at the time), but if that's not true, you won't be able to see any benefits of that tracking if you weren't running CFS the last time you interacted with the type (3) program.

The MuQSS CPU scheduler

Posted Apr 21, 2017 15:01 UTC (Fri) by excors (subscriber, #95769) [Link] (3 responses)

> However, ultimately, it is hard to quantify "smoothness" and "responsiveness" and turn them into an automated benchmark, so the best way for interested users to evaluate MuQSS is to try it out themselves.

I'm not sure that's a sensible conclusion; people are generally terrible at subjective evaluation. It's a bit like saying "it's hard to accurately determine whether homeopathy is more effective than a placebo for treating X, so the best way to evaluate it is to try it out yourself". You'll just end up with a ton of statistically meaningless anecdotal evidence, and a lot of people forming strong beliefs based on their and their friends' anecdotes, and it will be very hard to convince those people they're wrong once someone does manage to do a high-quality evaluation, and meanwhile a lot of time and energy will have been wasted that could have been spent on a more effective evidence-based approach.

What interested users should perhaps do is try to develop narrow but quantifiable benchmarks for their specific use case, which is hopefully much easier than a comprehensive general-purpose benchmark. Record latency from when the kernel receives an input event until the application finishes rendering its updated display while running BitTorrent in the background, or record the number of times a game misses a vsync, or whatever. It won't show that one scheduler is strictly better than another, and it won't tell whether the thing you're measuring is really correlated with a subjectively better user experience, and it might not be a very good measurement at all, but it's going to be much better than simply looking at your computer and trying to say how much snappier it feels.

The MuQSS CPU scheduler

Posted Apr 22, 2017 5:00 UTC (Sat) by conman (guest, #14830) [Link] (2 responses)

People do occasionally give relatively more concrete examples of behavioural improvements. On the ck-hack blog a recent comment said:

"- primusrun and nvidia-xrun with intel_cpufreq schedutil makes all Valve games I've played open several seconds faster and leaves me with unbelievably low mouse latency on an Optimus system compared to mainline and Windows.

- I/O detection for my external keyboard and mouse is really fast and never fails to register compared to the few times that happened on mainline.

- Dota 2 on CFS caps at 30 FPS after reaching a specific load from multiple unit selection (even though it can run well above this on Pause). MuQSS does not have this issue.

- TTY switching is noticeably faster."

Reference:
http://ck-hack.blogspot.com/2017/02/linux-410-ck1-muqss-v...

The MuQSS CPU scheduler

Posted Apr 22, 2017 6:02 UTC (Sat) by drag (guest, #31333) [Link] (1 responses)

Unless the guy actually had a way to measure all that stuff it's still subjective.

That's not to say that he is wrong. It's just without actual repeatable measurements it's subjective...

The MuQSS CPU scheduler

Posted Apr 22, 2017 6:13 UTC (Sat) by conman (guest, #14830) [Link]

I did say "relatively", however frame rate is not subjective.

Is this way of thinking old-fashioned?

Posted Apr 24, 2017 15:20 UTC (Mon) by runekock (subscriber, #50229) [Link] (4 responses)

The way we think about a scheduler is to divide a fixed amount of processor-power among the various jobs.

But it's an antiquated idea that the amount of processor-power is fixed. We increasingly have more cores than we can allow to run at full speed simultaneously. The power budget is the important limit, not the physical number of cores. Having a hotplug to power cores up/down, a governor to control their speed, and a scheduler to then distribute the result, seems unable to efficiently solve the problem in the future.

We need to be able to say e.g. that a low priority task is only allowed to run on a core running at low frequency (because the same amount of work uses more power when run at a high frequency).

Maybe start by determining the best distribution of tasks on to the number of cores that we choose to have running. Then the best speed for each core. And finally a simple scheduler on each core. In other words: let the distribution become the job of the hotplug, not the scheduler.

Is this way of thinking old-fashioned?

Posted Apr 25, 2017 2:15 UTC (Tue) by epa (subscriber, #39769) [Link] (3 responses)

the same amount of work uses more power when run at a high frequency

Is this really true?

Is this way of thinking old-fashioned?

Posted Apr 25, 2017 10:10 UTC (Tue) by runekock (subscriber, #50229) [Link]

I don't think anything in power management is true all the time :). But often.

Long story:
A logic gate uses power for switching and for leakage current. The power to perform one switch is proportional to the voltage, while leakage grows a lot faster than linearly with voltage. As voltage usually needs to be increased to allow the gate to work at a higher frequency, its power usage increases more than its work. The same applies to an entire core.

The opposing argument "race to sleep" is that the sooner the work gets done, the sooner we can power down everything. You can be pretty sure that things don't work out that way if you look at a core in isolation, but if more power hungry hardware is kept waiting, that's obviously a bad trade-off.

Is this way of thinking old-fashioned?

Posted Apr 25, 2017 11:02 UTC (Tue) by excors (subscriber, #95769) [Link] (1 responses)

As I understand it, it takes a roughly constant amount of energy to switch logic gates regardless of frequency. Increasing frequency increases power but decreases the time taken to complete the work, so the total energy stays the same. If some other parts of your system don't support frequency scaling, but do support a low-power sleep state, it's best to run the logic as fast as possible so you can put the rest of it to sleep sooner.

Except that you always want to run your logic at the lowest possible voltage that doesn't cause instability, and that voltage depends on frequency, and power scales with voltage squared, so high frequencies get really expensive. (That's why people have crazy liquid cooling systems to let them overclock their desktop PC CPUs by ~50%.)

At some point there's an optimal tradeoff between those factors. You can draw graphs of efficiency vs performance (source) - it seems most of those chips are most efficient at nearly but not quite their lowest supported frequency, but it varies a lot between different chip designs, and will depend on the characteristics of your specific workload.

From a broader point of view, the power/performance curve of a task isn't even continuous. If a game takes 17msec to compute each frame and therefore runs at 30fps (with vsync), and you run it slightly faster so it takes 16msec and can go at 60fps, now it's suddenly doing twice as much work per second and using twice as much power, though the user might be happier since the game is smoother (until their phone gets too hot to hold). That sounds hard for a scheduler to predict, without some signal from the application that it would be able to make good use of more performance.

(Of course in modern systems the OS doesn't really have any direct control over any of this stuff, it just has a protocol for negotiating with a power management chip running some proprietary firmware and who knows what that's going to do.)

Is this way of thinking old-fashioned?

Posted May 8, 2017 16:27 UTC (Mon) by s009988776655 (guest, #109620) [Link]

somewhat confusing arguments.
"Increasing frequency increases power but decreases the time taken to complete the work, so the total energy stays the same".
You need more power for higher frequency. Than later your write about the tradeoff. So yes, Intel and others choose the max. frequency of their chips where approx. 1% more power means 1% more computing power. But still you use a bit more energy because part of the nonlinearity comes from the fact that the coils, mosfets on the mainboard/gpu waste a lot of energy when you leave this sweetspot (someone estimated a nvidia titan x pascal consumes 60watts on the coils, mosfets, capacitor used for voltage regulation alone).
But on the other hand, the PSU runs more efficiently when there is some load (80+gold,platinum,etc). But still the greatest energy saver is activating the c8 power states, C1E which got my sysyem's idle from ~60 to 30 watt including the graphics card and reducing the voltage just the right amount to get linear performance. I got 4.2ghz at 1.2 volt instead of 4.0. Or 4.0ghz at 1.168v. Because as you said at the end there is no way the OS scheduler or hardware frequency firmware can predict the best strategy for every workload those parameters have the greatest impact, I guesstimate.

So vsync 30 vs vsync 60. The traditional vsync syncs frames to your monitors refresh rate. So it should compute the same amount of frames but only display the ones matching the 60 hz of a lcd screen. Frame limiter should produce what you described. But double frames = double power from 30 to 60 is probably not true in general. So I use a frame limiter because a game I play results in "coil whining" because the load is not high enough. The powerload on my titan x is almost the same, with 120fps or 60fps (limited).

This whole discussion scheduler/benchmarking/power consumption screams for optimization/machine learning/supervised learning. You could change the scheduler and/or parameters every day and just gather whether the user liked the experience or not, collect power consumption from the hardware monitors for cpu, gpu, and (external wall power meter). At least we could find the "personally optimal" scheduler with reduced power consumption. And use unusual patterns for malware/rootkit detection (to build a case for a lightweight-kernel-power-monitoring).

I know youtube is doing there scheduling, how loadbalancers request videos from the servers w.r.t. to power consumption. It makes no difference for an individual if you need 1% more watts on average but the effect on all linux users means less nuclear power plants.

I can't complain about my ubuntu unity desktop experience. But who knows if gnome, wayland, etc will annoy me. But I still will like the idea of power saving :)
So some students should find out. I would've liked the project, gather data, play with different windows managers, play games and learn about machine learning and scheduler algorithms =)