LWN: Comments on "The MuQSS CPU scheduler"

Is this way of thinking old-fashioned?

s009988776655 — Mon, 08 May 2017 16:27:16 +0000

somewhat confusing arguments.
"Increasing frequency increases power but decreases the time taken to complete the work, so the total energy stays the same".
You need more power for higher frequency. Than later your write about the tradeoff. So yes, Intel and others choose the max. frequency of their chips where approx. 1% more power means 1% more computing power. But still you use a bit more energy because part of the nonlinearity comes from the fact that the coils, mosfets on the mainboard/gpu waste a lot of energy when you leave this sweetspot (someone estimated a nvidia titan x pascal consumes 60watts on the coils, mosfets, capacitor used for voltage regulation alone).
But on the other hand, the PSU runs more efficiently when there is some load (80+gold,platinum,etc). But still the greatest energy saver is activating the c8 power states, C1E which got my sysyem's idle from ~60 to 30 watt including the graphics card and reducing the voltage just the right amount to get linear performance. I got 4.2ghz at 1.2 volt instead of 4.0. Or 4.0ghz at 1.168v. Because as you said at the end there is no way the OS scheduler or hardware frequency firmware can predict the best strategy for every workload those parameters have the greatest impact, I guesstimate.

So vsync 30 vs vsync 60. The traditional vsync syncs frames to your monitors refresh rate. So it should compute the same amount of frames but only display the ones matching the 60 hz of a lcd screen. Frame limiter should produce what you described. But double frames = double power from 30 to 60 is probably not true in general. So I use a frame limiter because a game I play results in "coil whining" because the load is not high enough. The powerload on my titan x is almost the same, with 120fps or 60fps (limited).

This whole discussion scheduler/benchmarking/power consumption screams for optimization/machine learning/supervised learning. You could change the scheduler and/or parameters every day and just gather whether the user liked the experience or not, collect power consumption from the hardware monitors for cpu, gpu, and (external wall power meter). At least we could find the "personally optimal" scheduler with reduced power consumption. And use unusual patterns for malware/rootkit detection (to build a case for a lightweight-kernel-power-monitoring).

I know youtube is doing there scheduling, how loadbalancers request videos from the servers w.r.t. to power consumption. It makes no difference for an individual if you need 1% more watts on average but the effect on all linux users means less nuclear power plants.

I can't complain about my ubuntu unity desktop experience. But who knows if gnome, wayland, etc will annoy me. But I still will like the idea of power saving :)
So some students should find out. I would've liked the project, gather data, play with different windows managers, play games and learn about machine learning and scheduler algorithms =)

Lots of block schedulers though.

gmatht — Sun, 30 Apr 2017 05:11:10 +0000

I find it ironic that the LWN story next to this was "Two new block I/O schedulers for 4.12".

Any reason why? I guess there is a big single jump from rotational to SSD block devices that justifies at least two specialised block schedulers.

Is this way of thinking old-fashioned?

excors — Tue, 25 Apr 2017 11:02:08 +0000

As I understand it, it takes a roughly constant amount of energy to switch logic gates regardless of frequency. Increasing frequency increases power but decreases the time taken to complete the work, so the total energy stays the same. If some other parts of your system don't support frequency scaling, but do support a low-power sleep state, it's best to run the logic as fast as possible so you can put the rest of it to sleep sooner.

Except that you always want to run your logic at the lowest possible voltage that doesn't cause instability, and that voltage depends on frequency, and power scales with voltage squared, so high frequencies get really expensive. (That's why people have crazy liquid cooling systems to let them overclock their desktop PC CPUs by ~50%.)

At some point there's an optimal tradeoff between those factors. You can draw graphs of efficiency vs performance (source) - it seems most of those chips are most efficient at nearly but not quite their lowest supported frequency, but it varies a lot between different chip designs, and will depend on the characteristics of your specific workload.

From a broader point of view, the power/performance curve of a task isn't even continuous. If a game takes 17msec to compute each frame and therefore runs at 30fps (with vsync), and you run it slightly faster so it takes 16msec and can go at 60fps, now it's suddenly doing twice as much work per second and using twice as much power, though the user might be happier since the game is smoother (until their phone gets too hot to hold). That sounds hard for a scheduler to predict, without some signal from the application that it would be able to make good use of more performance.

(Of course in modern systems the OS doesn't really have any direct control over any of this stuff, it just has a protocol for negotiating with a power management chip running some proprietary firmware and who knows what that's going to do.)

Is this way of thinking old-fashioned?

runekock — Tue, 25 Apr 2017 10:10:22 +0000

I don't think anything in power management is true all the time :). But often.

Long story:
A logic gate uses power for switching and for leakage current. The power to perform one switch is proportional to the voltage, while leakage grows a lot faster than linearly with voltage. As voltage usually needs to be increased to allow the gate to work at a higher frequency, its power usage increases more than its work. The same applies to an entire core.

The opposing argument "race to sleep" is that the sooner the work gets done, the sooner we can power down everything. You can be pretty sure that things don't work out that way if you look at a core in isolation, but if more power hungry hardware is kept waiting, that's obviously a bad trade-off.

Is this way of thinking old-fashioned?

epa — Tue, 25 Apr 2017 02:15:08 +0000

the same amount of work uses more power when run at a high frequency

Is this really true?

Is this way of thinking old-fashioned?

runekock — Mon, 24 Apr 2017 15:20:31 +0000

The way we think about a scheduler is to divide a fixed amount of processor-power among the various jobs.

But it's an antiquated idea that the amount of processor-power is fixed. We increasingly have more cores than we can allow to run at full speed simultaneously. The power budget is the important limit, not the physical number of cores. Having a hotplug to power cores up/down, a governor to control their speed, and a scheduler to then distribute the result, seems unable to efficiently solve the problem in the future.

We need to be able to say e.g. that a low priority task is only allowed to run on a core running at low frequency (because the same amount of work uses more power when run at a high frequency).

Maybe start by determining the best distribution of tasks on to the number of cores that we choose to have running. Then the best speed for each core. And finally a simple scheduler on each core. In other words: let the distribution become the job of the hotplug, not the scheduler.

Why not mainline it?

iabervon — Sun, 23 Apr 2017 21:22:12 +0000

I'd expect, with a typical interactive workload, there are: (1) some things that are going to do work when some external, invisible to the user, trigger occurs (web page loads/updates); (2) some things that happen continuously (video); and (3) some things that are going to do work which will be visible to the user, but don't have anything to do until the user does something (rendering text the user types). Prioritizing (3) when it has something to do over (1) requires paying attention when the user does something, which could be arbitrarily long ago.

The MuQSS thesis is that that kind of tracking isn't really beneficial (i.e., you can get (3) enough time despite (1) based on behavior at the time), but if that's not true, you won't be able to see any benefits of that tracking if you weren't running CFS the last time you interacted with the type (3) program.

Why not mainline it?

epa — Sun, 23 Apr 2017 17:17:59 +0000

Right, so the period of switching between the two should be reasonably long (though surely ten minutes is enough?) and any user reports soon after switching to a new scheduler would have to be disregarded (again, surely one minute is enough for the scheduler to be warmed up again?).

Are you saying that, on a typical interactive workload, a scheduler tunes its decisions using more than just the last few seconds of activity?

The MuQSS CPU scheduler

conman — Sat, 22 Apr 2017 06:13:41 +0000

I did say "relatively", however frame rate is not subjective.

The MuQSS CPU scheduler

drag — Sat, 22 Apr 2017 06:02:09 +0000

Unless the guy actually had a way to measure all that stuff it's still subjective.

That's not to say that he is wrong. It's just without actual repeatable measurements it's subjective...

The MuQSS CPU scheduler

conman — Sat, 22 Apr 2017 05:00:28 +0000

People do occasionally give relatively more concrete examples of behavioural improvements. On the ck-hack blog a recent comment said:

"- primusrun and nvidia-xrun with intel_cpufreq schedutil makes all Valve games I've played open several seconds faster and leaves me with unbelievably low mouse latency on an Optimus system compared to mainline and Windows.

- I/O detection for my external keyboard and mouse is really fast and never fails to register compared to the few times that happened on mainline.

- Dota 2 on CFS caps at 30 FPS after reaching a specific load from multiple unit selection (even though it can run well above this on Pause). MuQSS does not have this issue.

- TTY switching is noticeably faster."

Reference:
http://ck-hack.blogspot.com/2017/02/linux-410-ck1-muqss-v...

The MuQSS CPU scheduler

flussence — Fri, 21 Apr 2017 17:39:20 +0000

I believe this paper had a similar number: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

That's a bit more thorough than any bug report I could ever write.

Why not mainline it?

iabervon — Fri, 21 Apr 2017 17:25:21 +0000

Pluggable schedulers would be a much harder problem, because most schedulers look at what processes did in the recent past in order to decide about policy. So, at best, switching schedulers will detune your performance when it happens.

That sort of comparison would probably just tell you which scheduler benefits least from tracking, since you probably won't be able to notice the difference in smoothness between two schedulers at steady state, as compared to the glitch when you switch to one that needs its information on what you're watching.

The MuQSS CPU scheduler

excors — Fri, 21 Apr 2017 15:01:22 +0000

> However, ultimately, it is hard to quantify "smoothness" and "responsiveness" and turn them into an automated benchmark, so the best way for interested users to evaluate MuQSS is to try it out themselves.

I'm not sure that's a sensible conclusion; people are generally terrible at subjective evaluation. It's a bit like saying "it's hard to accurately determine whether homeopathy is more effective than a placebo for treating X, so the best way to evaluate it is to try it out yourself". You'll just end up with a ton of statistically meaningless anecdotal evidence, and a lot of people forming strong beliefs based on their and their friends' anecdotes, and it will be very hard to convince those people they're wrong once someone does manage to do a high-quality evaluation, and meanwhile a lot of time and energy will have been wasted that could have been spent on a more effective evidence-based approach.

What interested users should perhaps do is try to develop narrow but quantifiable benchmarks for their specific use case, which is hopefully much easier than a comprehensive general-purpose benchmark. Record latency from when the kernel receives an input event until the application finishes rendering its updated display while running BitTorrent in the background, or record the number of times a game misses a vsync, or whatever. It won't show that one scheduler is strictly better than another, and it won't tell whether the thing you're measuring is really correlated with a subjectively better user experience, and it might not be a very good measurement at all, but it's going to be much better than simply looking at your computer and trying to say how much snappier it feels.

Why not mainline it?

epa — Fri, 21 Apr 2017 14:04:11 +0000

This sounds like the biggest reason to have pluggable schedulers: so the system can randomly switch between schedulers every few minutes and not tell you which one is running. Then you could report your subjective experience of smoothness by pressing a button or something, and have a proper blind trial.

The MuQSS CPU scheduler

Sesse — Fri, 21 Apr 2017 13:31:37 +0000

If you can actually measure 25% speedup from CFS to BFS on a repeatable benchmark, I'm sure the CFS people would love to take a bug report. Anything objective and quantifiable is great news.

Why not mainline it?

Sesse — Fri, 21 Apr 2017 13:29:51 +0000

The primary reason is that there are no actual tests indicating it's any better, short of hearsay. When BFS first came out, the Android team did blind user studies, and it didn't live up to the hype.

Why not mainline it?

mtaht — Fri, 21 Apr 2017 06:24:21 +0000

Ironically I have been looking into similar methods for better handing packet scheduling across multiple queues and pieces of hardware to eliminate microbursts and the like. I'd settled on skip lists, a similar runqueue, and a similar deadline scheduler... long before reading this article. Also key is trying to push a few difficult things back into hardware where they belong so things can scale better past 10gigE.

That is a hobby project just now, I'm not trying to change the world, just take a fresh look at the design space.

The MuQSS CPU scheduler

Otus — Fri, 21 Apr 2017 05:08:40 +0000

>> it is hard to quantify "smoothness" and "responsiveness" and turn them into an automated benchmark
> That's true, it's a pain to quantify things like microstutter in multimedia - you need extra hardware to measure that.

Oughtn't it be measurable using something similar to the frame-time analysis they nowadays do for game benchmarking?

Why not mainline it?

liam — Fri, 21 Apr 2017 04:57:48 +0000

Your comment brought this to mind

https://lists.linuxfoundation.org/pipermail/ksummit-discu...

It's an unfortunate thing that success can have downsides.

Why not mainline it?

conman — Fri, 21 Apr 2017 02:18:22 +0000

Linus has said in the past that he absolutely detests specialisation and doesn't want more than one scheduler in mainline. It's the same reason the "plugsched" pluggable CPU scheduler framework that went along with the original alternative schedulers, staircase, RSDL and staircase deadline was abandoned. Therefore it would have to replace the mainline scheduler en-bloc. As a hobby amateur coder working on a scheduler in one's spare time, there would be zero chance of creating something that meets all the criteria of trumping every mainline performance benchmark and feature to replace CFS.

Why not mainline it?

droundy — Fri, 21 Apr 2017 01:02:18 +0000

I can understand Kolias not being interested in the effort to do so, but it sounds like a scheduler that would be of wide interest. Is there any other reason not to mainline this?

The MuQSS CPU scheduler

flussence — Thu, 20 Apr 2017 18:34:48 +0000

>it is hard to quantify "smoothness" and "responsiveness" and turn them into an automated benchmark
That's true, it's a pain to quantify things like microstutter in multimedia - you need extra hardware to measure that. MuQSS doesn't make Linux magically perform like BeOS all the time.

But my favourite number to bring up is throughput: I used to run Folding@Home (entirely CPU-bound, MPI-heavy, scientific number-crunching), and took note of every little tweak available at the time. Transparent huge pages was something like 2-3% speedup. Going from CFS to BFS was 25%.