Announcing a major update for BFS for linux-4.7 based kernels.
BFS by itself:
4.7-sched-bfs-480.patch
-ck branded linux-4.7-ck2 patches:
linux-4.7-ck2
This is the largest BFS update in a long time. The various problems that had been accumulating forced me to spend a more extended period fixing BFS to work with the latest mainline changes and encouraged me to overhaul some areas that had long been needing it.
The changes are:
- Fixed the crash when SMT NICE is configured in on a CPU without SMT.
- Added my skiplist implementation.
- Converted BFS from its long-standing O(n) lookup to use skiplists.
- Fix crash when SMT NICE is enabled on some hardware
- Fix try_preempt missing the locality diff effect in non-interactive mode
- Ignore busy threads/caches when still on the same core
- Reworked the testing of idle threads and cores for less overhead and to correctly identify idle siblings
- Fix the CPU load that's passed to the cpu frequency governor, fixing a crash and non-working schedutil governor.
Actually incorporating the skiplists that I had experimented with a long time ago was decided on by the fact that I was able to trim the skiplist overhead further and maintain identical semantics for process selection (maintaining interactivity) whereas on the previous experiment I had never completed the work. Throughput testing shows virtually identical performance on normal workloads and theoretically would be helpful in extreme overload cases.
The original post regarding skip lists was here:
bfs-and-skip-lists.html
This now means that BFS is no longer O(n) lookup after O(1) insertion. It is now O(log(n)) insertion, O(1) lookup and O(k) removal where k <= 16, thereby tackling a long-standing criticism of the overall design.
I did not find a specific cause for peoples' inability to suspend to ram so I doubt this has been fixed despite the large code update.
The list of patches making up bfs480 is as follows:
bfs472-fix_set_task_cpu.patch skiplists.patch bfs472-skiplist.patch bfs-delay-smt-siblings.patch bfs-fix-noninteractive-try-preempt.patch bfs-ignore_local_busy.patch bfs-rework-idles.patch bfs-fix-schedutil.patch bfs-v480.patch
As always I'm giving this to you not long after I've finished coding it so all the usual warnings apply, especially with an update of this size.
EDIT: Uniprocessor build fix: bfs480-fix-upbuild.patch
EDIT2: Here is a test patch to try and improve cpufreq behaviour: bfs480-rework_cpufreq.patch
Enjoy!
お楽しみ下さい
-ck
What cpufreq governor are you using? Try switching them.
ReplyDeleteVery odd as that's the same governor I use... Did you have the same problem with ck1/bfs472?
ReplyDeleteAlso sounds governor related. Can you try pstate+performance temporarily on the i7 even just to see if the problem goes away?
ReplyDeleteThanks. There's an awful lot of changes and I'd like to find out what in particular causes your issue; the cpufrequency changes are the obvious culprit.
ReplyDeleteHi Con,
ReplyDeletethanks for new patch. On my systems i7 laptop (ondemand) and Phenom II (ondemand) kernel behaves ok. What I have done is - play native and wine (gallium 9) games, standard web browsing etc.
So far no crashes or weird behaviour.
Can it be that BFS 480 is a bit more power hungry than 472?
regards
Eduardo
The cpufrequency governors will respond more rapidly both up and down, so I guess it depends entirely on your workload though I expect it should be the same overall power use but snappier to respond. gourdcaptain above was using the pstate driver which has some different quirks in its behaviour to ondemand or schedutil (both of which should work fine on 480). You might want to try schedutil now instead of ondemand.
DeleteThanks for reply, I'll recompile the kernel with schedutil governor enabled.
DeleteBut before I do that and as this kernel will be meant to be used in a laptop where battery is needed, will it actually be beneficial to use BFS with 300Hz/500Hz? Sure 1000Hz is the best and I'll use it on desktop, but I'm afraid it'll drain battery fast on laptop.
With dynticks enabled, 1000Hz won't use any significant amount more battery than even 100Hz.
DeleteI compiled kernel with 1000Hz, schedutil and experimental cgroup stub patch.
DeleteWith schedutil governor CPU frequency seems quite high, for iddle system (boot to Unity desktop, in terminal open top -d 1 and i7z to monitor freq) it almost always stays at 2.0-2.1 GHz level, with odemand it stays approx at 1.3 GHz.
My CPU (i7-3612QM) standard freq ranges are 1.2 - 2.1 + turbo.
Other thing is that I have very reliable way to crash a kernel. See, my previous build (which do NOT crash) differs from this with: HZ (500 -> 1000), + schedutil + cgroup stub patch.
It's really odd, but when Ubuntu says "we have a problem, would you like to report it, so we fix it and everyone is happy", when I'm very helpful to say "report", it runs pkexec for authentication to run apport-gtk and as soon as I enter password and press enter the system freeze and caps lock is happily blinking, nothing in the logs, I even tried tail -f syslog, but it does not show anything. I can reproduce this 100% (I tried about 10 times).
br, Eduardo
Important detail: at the time of crash "ondemand" governor was active, so we can't blame "schedutil" yet.
Deletebr, Eduardo
in my haswell cpu with linux 4.7.2-1 the system freezes totally and i cant do nothing, but the reisub is not funcion, in the kernel 4.7.1-1 the system not freezes, the kernel i use is the kernel of gravinsky repo for haswell
ReplyDeletei use too the pstate driver of intel with powersave if it is important
DeleteThanks. I'm not so much interested in the reporting speeds as the behaviour. There is no such thing as an actual frequency with the pstate driver if you read the intel docs as they are continually fluctuating dynamically with microsecond precision changes within the CPU itself and the operating system can only guess what the current CPU speed is based on what the CPU reports back. The reported range will often show nonsense values. I see infinite sometimes as the CPU speed reported by i7z... Does it behave okay though?
ReplyDeleteThat's okay I understand and appreciate you testing it despite that concern. The latest changes shouldn't actually behave any different to previous BFS releases since the scheduling algorithm is identical just with a different way to look up the tasks. Only the load reported to cpu frequency has been radically changed. Thanks.
ReplyDeleteThere appears to be some performance regression with bfs480, that I noticed with xonotic. To test this, I played the same map against bots with standard 4.7.2 kernel, and both -ck1 and -ck2 patchsets, using different cpufreq governors, and kept an eye on fps meter:
ReplyDelete4.7.2-ck2
performance ~160-200fps
schedutil ~60-160fps
ondemand ~75-200fps
4.7.2-ARCH
performance ~175-200fps
schedutil ~90-170fps
ondemand ~150-200fps
With 4.7.2-ck1 I had pretty much same results as with 4.7.2-ARCH (although I think the game feels much smoother with -ck1).
I couldn't figure out how to run the official xonotic benchmark script, so I only kept an eye for max/min fps count, but the average fps is also certainly lower with -ck2.
I can run more accurate tests later if that seems useful.
in my laptop with amd and acpi cpufreq it is normal function, I hadn't any type of freeze, included with schedutil, but with my intel desktop.. I have to test with acpi cpufreq but with p-state powersave I have freezes randomly but much times.. it is impossible to do nothing at this way
ReplyDeleteThanks for the feedback guys. I'm pretty sure this is all cpufrequency related issues. I'll create a little patch when time permits that will revert the behaviour to the cpufrequency as used in ck1, though I'll need to do something specific for schedutil since that did not work at all in ck1.
ReplyDeleteI've posted a test patch to try and improve cpufrequency behaviour in the top post as edit2. Please give it a try.
ReplyDeletethanks, i will try when it gets update in the repos of gravisky
DeleteWhich bfs version should the patch be applied against?
DeleteI tried patching against 4.7-ck2, but it fails on bfs.c at hunk #6
It needs all the patches in pending/ applied
DeleteThanks, I didn't realize there were more pending patches than the uniprocessor build fix.
DeleteHowever I was unable to reproduce my previous issue with Xonotic with plain 4.7.2-ck2. It could've been pebkac error or something else on my system, or the recent mesa update fixed the issue.
Nevertheless, I run some benchmarks with graphicsmagick, both single thread (which should be a good indicator for cpufreq issues, as the thread keeps jumping from core to core and ramping the frequency up each time) and 1 thread per core. Results are here: http://pastebin.com/C7mnPKWt
schedutil governor doesn't seem to work optimally with -ck2, but pending and/or cpufreq rework patches brings it up to par with ondemand.
Also, kernel with cfs seems to be performing ~4% faster than bfs with this specific test. I also tried setting /proc/sys/kernel/interactive to 0, but that didn't have any major effect.
I'll report back if I notice any more strangeness, but for now everything seems to be in order.
Thanks Con for your continued work on bfs.
ReplyDeleteAfter reading the comments here and on Alfred Chen's blog about bfs performance issues on linux 4.7.2, I ran some tests on my i5-3210m.
I post the results in case you find it usefull.
The tests are :
build ffmpeg 2.8.7 with j2 and j4, compress the linux 4.4.11 sources with bz2 and xz (multi-threaded), transcode a song with lame and a video with x264.
Each test is run 3 times and realtime is measured with the time command.
The linux kernel is 4.7.2, configured with 300Hz timer, dynticks enabled, smt_nice disabled.
All patches in pending were applied for bfs.
The average time in seconds are:
cfs+pstates
make j2 290,192
make j4 246,975
bz2 59,685
xz 105,030
lame 12,011
x264 64,252
bfs+pstates
make j2 494,615
make j4 259,307
bz2 128,070
xz 145,811
lame 28,241
x264 66,334
bfs+ondemand
make j2 376,681
make j4 246,377
bz2 86,858
xz 104,323
lame 13,364
x264 64,868
bfs+performance
make j2 335,899
make j4 245,840
bz2 59,126
xz 104,875
lame 11,945
x264 63,967
bfs+pstates+bfs480-rework_cpufreq.patch
make j2 506,882
make j4 266,940
bz2 128,982
xz 131,345
lame 27,671
x264 66,298
Pedro
Oh I forgot. In case anyone wonder, the comma here is the decimal mark.
ReplyDeleteAnd by the way here are the results for linux 4.4+bfs that I use (same configuration as my previous post).
bfs 4.4+pstates
make j2 299,159
make j4 245,233
bz2 61,407
xz 104,823
lame 13,837
x264 64,039
Pedro
I have trouble posting this it seems...
ReplyDeleteHere I go again.
Thanks Con for your continued work on bfs.
After reading the comments here and on Alfred Chen's blog about bfs performance issues on linux 4.7.2, I ran some tests on my i5-3210m.
I post the results in case you find it usefull.
The tests are :
build ffmpeg 2.8.7 with j2 and j4, compress the linux 4.4.11 sources with bz2 and xz (multi-threaded), transcode a song with lame and a video with x264.
Each test is run 3 times and realtime is measured with the time command.
The linux kernel is 4.7.2, configured with 300Hz timer, dynticks enabled, smt_nice disabled.
All patches in pending were applied for bfs.
The average time in seconds are:
cfs+pstates
make j2 290,192
make j4 246,975
bz2 59,685
xz 105,030
lame 12,011
x264 64,252
bfs+pstates
make j2 494,615
make j4 259,307
bz2 128,070
xz 145,811
lame 28,241
x264 66,334
bfs+ondemand
make j2 376,681
make j4 246,377
bz2 86,858
xz 104,323
lame 13,364
x264 64,868
bfs+performance
make j2 335,899
make j4 245,840
bz2 59,126
xz 104,875
lame 11,945
x264 63,967
bfs+pstates+bfs480-rework_cpufreq.patch
make j2 506,882
make j4 266,940
bz2 128,982
xz 131,345
lame 27,671
x264 66,298
Pedro
I don't understand why my post keeps beeing deleted.
ReplyDeleteAnyway, thanks Con for your continued work on bfs.
After reading the comments here and on Alfred Chen's blog about bfs performance issues on linux 4.7.2, I ran some tests on my i5-3210m.
I post the results in case you find it usefull.
The tests are :
build ffmpeg 2.8.7 with j2 and j4, compress the linux 4.4.11 sources with bz2 and xz (multi-threaded), transcode a song with lame and a video with x264.
Each test is run 3 times and realtime is measured with the time command.
The linux kernel is 4.7.2, configured with 300Hz timer, dynticks enabled, smt_nice disabled.
All patches in pending were applied for bfs.
The average time in seconds are:
cfs+pstates
make j2 290,192
make j4 246,975
bz2 59,685
xz 105,030
lame 12,011
x264 64,252
bfs+pstates
make j2 494,615
make j4 259,307
bz2 128,070
xz 145,811
lame 28,241
x264 66,334
bfs+ondemand
make j2 376,681
make j4 246,377
bz2 86,858
xz 104,323
lame 13,364
x264 64,868
bfs+performance
make j2 335,899
make j4 245,840
bz2 59,126
xz 104,875
lame 11,945
x264 63,967
Pedro
And here are the results with the test patch
ReplyDeletebfs+pstates+bfs480-rework_cpufreq.patch
make j2 506,882
make j4 266,940
bz2 128,982
xz 131,345
lame 27,671
x264 66,298
Pedro
As my first post with all the results wasn't published (maybe it was too big), I'll just post the results with cfs.
ReplyDeleteMy cpu is a i5-3210m.
The tests are :
build ffmpeg 2.8.7 with j2 and j4, compress the linux 4.4.11 sources with bz2 and xz (multi-threaded), transcode a song with lame and a video with x264.
Each test is run 3 times and realtime is measured with the time command.
The linux kernel is 4.7.2, configured with 300Hz timer, dynticks enabled, smt_nice disabled.
The average time in seconds are:
cfs+pstates
make j2 290,192
make j4 246,975
bz2 59,685
xz 105,030
lame 12,011
x264 64,252
Thanks Con for your continued work on bfs.
Pedro
Thanks for those. I'm aware of the "performance" regressions as I consciously compromised throughput even further a while back when I added the interactive flag which you can enable and see how that affects throughput. However these are exacerbated by the poor signalling I'm giving cpu frequency which is why I've had to keep working on that component. To that end I have yet another patch to add on top of the rework cpufreq patch.
Deletehttp://ck.kolivas.org/patches/bfs/4.0/4.7/Testing/bfs480-more_cpufreq.patch
As the number of patches has grown a lot I will probably post another full BFS update since 480 had regressions anyway. So please try the extra patch on top of the rest and separately try disabling interactive mode
echo 0 > /proc/sys/kernel/interactive
Ok here are the results with bfs480-more_cpufreq.patch applied
Deletebfs+pstate+bfs480-rework_cpufreq.patch+bfs480-more_cpufreq.patch
make j2 522,045
make j4 278,736
bz2 129,263
xz 149,281
lame 24,166
x264 67,921
And the same with interactive=0
bfs+pstate+bfs480-rework_cpufreq.patch+bfs480-more_cpufreq.patch+interactive=0
make j2 488,732
make j4 283,691
bz2 130,898
xz 202,841
lame 30,567
x264 69,809
It seems bfs480-more_cpufreq.patch and interactive=0 didn't make any significant difference for the tests I run.
Let me know if you need more tests.
And take your time. My system is fine with linux 4.4+bfs :)
Pedro
Thanks. And one more one-liner patch to add on top please.
Deletehttp://ck.kolivas.org/patches/bfs/4.0/4.7/Testing/bfs480-revert_idle.patch
Ok, thanks.
DeleteI will try this one tomorrow.
I will also test linux 4.4+bfs with interactive=1.
I use interactive=0 on my system without feeling any difference compared to interactive=1, but I never did any benchmark.
Pedro
I was hoping that one day I'd improved things enough that I would be able to remove the interactive flag so maybe next release I'll try setting it to zero by default and wait till I get feedback from users.
DeleteI posted some benchmarks earlier, but apparently the post didn't make it here after all. Anyway, mysteriously I was unable my issues with Xonotic. Nevertheless, I run graphicsmagick benchmark, for single thread (which I'd imagine would be good for cpufreq testing, as the process jumps from core to core ramping up the frequency each time) and one thread per core. Results are here: http://pastebin.com/jBfA1NB3
ReplyDeleteNotably, schedutil governor has low performance with 4.7.2-ck2, and results fluctuated a lot between multiple runs. bfs480-rework_cpufreq.patch seems to bring schedutil up to par with ondemand, and benchmark results are more consistent between runs. However bfs480-more_cpufreq.patch seems to re-introduce the fluctuating test results, even though the overall performance seems pretty good.
CFS seems to perform better than BFS with this test, even with interactive mode disabled.
I posted some benchmarks earlier, but apparently the post didn't make it here after all. Anyway, mysteriously I was unable my issues with Xonotic. Nevertheless, I run graphicsmagick benchmark, for single thread (which I'd imagine would be good for cpufreq testing, as the process jumps from core to core ramping up the frequency each time) and one thread per core. Results are here: http://pastebin.com/jBfA1NB3
ReplyDeleteNotably, schedutil governor has low performance with 4.7.2-ck2, and results fluctuated a lot between multiple runs. bfs480-rework_cpufreq.patch seems to bring schedutil up to par with ondemand, and benchmark results are more consistent between runs. However bfs480-more_cpufreq.patch seems to re-introduce the fluctuating test results, even though the overall performance seems pretty good.
CFS seems to perform better than BFS with this test, even with interactive mode disabled.
Blogspot is filtering your posts as spam automatically. I've picked out your original comments and reinstated them now.
Delete