Friday, 2 September 2016

BFS 480 with skip lists, linux-4.7-ck2

Announcing a major update for BFS for linux-4.7 based kernels.

BFS by itself:
4.7-sched-bfs-480.patch

-ck branded linux-4.7-ck2 patches:
linux-4.7-ck2

This is the largest BFS update in a long time. The various problems that had been accumulating forced me to spend a more extended period fixing BFS to work with the latest mainline changes and encouraged me to overhaul some areas that had long been needing it.

The changes are:
  • Fixed the crash when SMT NICE is configured in on a CPU without SMT.
  • Added my skiplist implementation.
  • Converted BFS from its long-standing O(n) lookup to use skiplists.
  • Fix crash when SMT NICE is enabled on some hardware
  • Fix try_preempt missing the locality diff effect in non-interactive mode
  • Ignore busy threads/caches when still on the same core
  • Reworked the testing of idle threads and cores for less overhead and to correctly identify idle siblings
  • Fix the CPU load that's passed to the cpu frequency governor, fixing a crash and non-working schedutil governor.
The short summary is I've fixed a number of showstopper bugs on the last version, and improved throughput .

Actually incorporating the skiplists that I had experimented with a long time ago was decided on by the fact that I was able to trim the skiplist overhead further and maintain identical semantics for process selection (maintaining interactivity) whereas on the previous experiment I had never completed the work. Throughput testing shows virtually identical performance on normal workloads and theoretically would be helpful in extreme overload cases.

The original post regarding skip lists was here:
bfs-and-skip-lists.html


This now means that BFS is no longer O(n) lookup after O(1) insertion. It is now O(log(n)) insertion, O(1) lookup and O(k) removal where k <= 16, thereby tackling a long-standing criticism of the overall design.

I did not find a specific cause for peoples' inability to suspend to ram so I doubt this has been fixed despite the large code update.


The list of patches making up bfs480 is as follows:

bfs472-fix_set_task_cpu.patch
skiplists.patch
bfs472-skiplist.patch
bfs-delay-smt-siblings.patch
bfs-fix-noninteractive-try-preempt.patch
bfs-ignore_local_busy.patch
bfs-rework-idles.patch
bfs-fix-schedutil.patch
bfs-v480.patch

As always I'm giving this to you not long after I've finished coding it so all the usual warnings apply, especially with an update of this size.

EDIT: Uniprocessor build fix: bfs480-fix-upbuild.patch
EDIT2: Here is a test patch to try and improve cpufreq behaviour: bfs480-rework_cpufreq.patch

Enjoy!
お楽しみ下さい
-ck

36 comments:

  1. What cpufreq governor are you using? Try switching them.

    ReplyDelete
  2. Very odd as that's the same governor I use... Did you have the same problem with ck1/bfs472?

    ReplyDelete
  3. Also sounds governor related. Can you try pstate+performance temporarily on the i7 even just to see if the problem goes away?

    ReplyDelete
  4. Thanks. There's an awful lot of changes and I'd like to find out what in particular causes your issue; the cpufrequency changes are the obvious culprit.

    ReplyDelete
  5. Hi Con,

    thanks for new patch. On my systems i7 laptop (ondemand) and Phenom II (ondemand) kernel behaves ok. What I have done is - play native and wine (gallium 9) games, standard web browsing etc.
    So far no crashes or weird behaviour.

    Can it be that BFS 480 is a bit more power hungry than 472?

    regards
    Eduardo

    ReplyDelete
    Replies
    1. The cpufrequency governors will respond more rapidly both up and down, so I guess it depends entirely on your workload though I expect it should be the same overall power use but snappier to respond. gourdcaptain above was using the pstate driver which has some different quirks in its behaviour to ondemand or schedutil (both of which should work fine on 480). You might want to try schedutil now instead of ondemand.

      Delete
    2. Thanks for reply, I'll recompile the kernel with schedutil governor enabled.
      But before I do that and as this kernel will be meant to be used in a laptop where battery is needed, will it actually be beneficial to use BFS with 300Hz/500Hz? Sure 1000Hz is the best and I'll use it on desktop, but I'm afraid it'll drain battery fast on laptop.

      Delete
    3. With dynticks enabled, 1000Hz won't use any significant amount more battery than even 100Hz.

      Delete
    4. I compiled kernel with 1000Hz, schedutil and experimental cgroup stub patch.
      With schedutil governor CPU frequency seems quite high, for iddle system (boot to Unity desktop, in terminal open top -d 1 and i7z to monitor freq) it almost always stays at 2.0-2.1 GHz level, with odemand it stays approx at 1.3 GHz.
      My CPU (i7-3612QM) standard freq ranges are 1.2 - 2.1 + turbo.

      Other thing is that I have very reliable way to crash a kernel. See, my previous build (which do NOT crash) differs from this with: HZ (500 -> 1000), + schedutil + cgroup stub patch.
      It's really odd, but when Ubuntu says "we have a problem, would you like to report it, so we fix it and everyone is happy", when I'm very helpful to say "report", it runs pkexec for authentication to run apport-gtk and as soon as I enter password and press enter the system freeze and caps lock is happily blinking, nothing in the logs, I even tried tail -f syslog, but it does not show anything. I can reproduce this 100% (I tried about 10 times).

      br, Eduardo

      Delete
    5. Important detail: at the time of crash "ondemand" governor was active, so we can't blame "schedutil" yet.

      br, Eduardo

      Delete
  6. in my haswell cpu with linux 4.7.2-1 the system freezes totally and i cant do nothing, but the reisub is not funcion, in the kernel 4.7.1-1 the system not freezes, the kernel i use is the kernel of gravinsky repo for haswell

    ReplyDelete
    Replies
    1. i use too the pstate driver of intel with powersave if it is important

      Delete
  7. Thanks. I'm not so much interested in the reporting speeds as the behaviour. There is no such thing as an actual frequency with the pstate driver if you read the intel docs as they are continually fluctuating dynamically with microsecond precision changes within the CPU itself and the operating system can only guess what the current CPU speed is based on what the CPU reports back. The reported range will often show nonsense values. I see infinite sometimes as the CPU speed reported by i7z... Does it behave okay though?

    ReplyDelete
  8. That's okay I understand and appreciate you testing it despite that concern. The latest changes shouldn't actually behave any different to previous BFS releases since the scheduling algorithm is identical just with a different way to look up the tasks. Only the load reported to cpu frequency has been radically changed. Thanks.

    ReplyDelete
  9. There appears to be some performance regression with bfs480, that I noticed with xonotic. To test this, I played the same map against bots with standard 4.7.2 kernel, and both -ck1 and -ck2 patchsets, using different cpufreq governors, and kept an eye on fps meter:

    4.7.2-ck2
    performance ~160-200fps
    schedutil ~60-160fps
    ondemand ~75-200fps

    4.7.2-ARCH
    performance ~175-200fps
    schedutil ~90-170fps
    ondemand ~150-200fps

    With 4.7.2-ck1 I had pretty much same results as with 4.7.2-ARCH (although I think the game feels much smoother with -ck1).

    I couldn't figure out how to run the official xonotic benchmark script, so I only kept an eye for max/min fps count, but the average fps is also certainly lower with -ck2.

    I can run more accurate tests later if that seems useful.

    ReplyDelete
  10. in my laptop with amd and acpi cpufreq it is normal function, I hadn't any type of freeze, included with schedutil, but with my intel desktop.. I have to test with acpi cpufreq but with p-state powersave I have freezes randomly but much times.. it is impossible to do nothing at this way

    ReplyDelete
  11. Thanks for the feedback guys. I'm pretty sure this is all cpufrequency related issues. I'll create a little patch when time permits that will revert the behaviour to the cpufrequency as used in ck1, though I'll need to do something specific for schedutil since that did not work at all in ck1.

    ReplyDelete
  12. I've posted a test patch to try and improve cpufrequency behaviour in the top post as edit2. Please give it a try.

    ReplyDelete
    Replies
    1. thanks, i will try when it gets update in the repos of gravisky

      Delete
    2. Which bfs version should the patch be applied against?

      I tried patching against 4.7-ck2, but it fails on bfs.c at hunk #6

      Delete
    3. It needs all the patches in pending/ applied

      Delete
    4. Thanks, I didn't realize there were more pending patches than the uniprocessor build fix.

      However I was unable to reproduce my previous issue with Xonotic with plain 4.7.2-ck2. It could've been pebkac error or something else on my system, or the recent mesa update fixed the issue.

      Nevertheless, I run some benchmarks with graphicsmagick, both single thread (which should be a good indicator for cpufreq issues, as the thread keeps jumping from core to core and ramping the frequency up each time) and 1 thread per core. Results are here: http://pastebin.com/C7mnPKWt

      schedutil governor doesn't seem to work optimally with -ck2, but pending and/or cpufreq rework patches brings it up to par with ondemand.

      Also, kernel with cfs seems to be performing ~4% faster than bfs with this specific test. I also tried setting /proc/sys/kernel/interactive to 0, but that didn't have any major effect.

      I'll report back if I notice any more strangeness, but for now everything seems to be in order.

      Delete
  13. Thanks Con for your continued work on bfs.

    After reading the comments here and on Alfred Chen's blog about bfs performance issues on linux 4.7.2, I ran some tests on my i5-3210m.
    I post the results in case you find it usefull.

    The tests are :
    build ffmpeg 2.8.7 with j2 and j4, compress the linux 4.4.11 sources with bz2 and xz (multi-threaded), transcode a song with lame and a video with x264.
    Each test is run 3 times and realtime is measured with the time command.

    The linux kernel is 4.7.2, configured with 300Hz timer, dynticks enabled, smt_nice disabled.
    All patches in pending were applied for bfs.

    The average time in seconds are:

    cfs+pstates
    make j2   290,192
    make j4   246,975
    bz2           59,685
    xz            105,030
    lame          12,011
    x264          64,252

    bfs+pstates
    make j2   494,615
    make j4   259,307
    bz2          128,070
    xz            145,811
    lame          28,241
    x264          66,334

    bfs+ondemand
    make j2   376,681
    make j4   246,377
    bz2           86,858
    xz            104,323
    lame          13,364
    x264          64,868

    bfs+performance
    make j2   335,899
    make j4   245,840
    bz2           59,126
    xz            104,875
    lame          11,945
    x264          63,967

    bfs+pstates+bfs480-rework_cpufreq.patch
    make j2   506,882
    make j4   266,940
    bz2          128,982
    xz            131,345
    lame          27,671
    x264          66,298

    Pedro

    ReplyDelete
  14. Oh I forgot. In case anyone wonder, the comma here is the decimal mark.

    And by the way here are the results for linux 4.4+bfs that I use (same configuration as my previous post).

    bfs 4.4+pstates
    make j2   299,159
    make j4   245,233
    bz2           61,407
    xz            104,823
    lame          13,837
    x264          64,039

    Pedro

    ReplyDelete
  15. I have trouble posting this it seems...
    Here I go again.

    Thanks Con for your continued work on bfs.

    After reading the comments here and on Alfred Chen's blog about bfs performance issues on linux 4.7.2, I ran some tests on my i5-3210m.
    I post the results in case you find it usefull.

    The tests are :
    build ffmpeg 2.8.7 with j2 and j4, compress the linux 4.4.11 sources with bz2 and xz (multi-threaded), transcode a song with lame and a video with x264.
    Each test is run 3 times and realtime is measured with the time command.

    The linux kernel is 4.7.2, configured with 300Hz timer, dynticks enabled, smt_nice disabled.
    All patches in pending were applied for bfs.

    The average time in seconds are:

    cfs+pstates
    make j2   290,192
    make j4   246,975
    bz2           59,685
    xz            105,030
    lame          12,011
    x264          64,252

    bfs+pstates
    make j2   494,615
    make j4   259,307
    bz2          128,070
    xz            145,811
    lame          28,241
    x264          66,334

    bfs+ondemand
    make j2   376,681
    make j4   246,377
    bz2           86,858
    xz            104,323
    lame          13,364
    x264          64,868

    bfs+performance
    make j2   335,899
    make j4   245,840
    bz2           59,126
    xz            104,875
    lame          11,945
    x264          63,967

    bfs+pstates+bfs480-rework_cpufreq.patch
    make j2   506,882
    make j4   266,940
    bz2          128,982
    xz            131,345
    lame          27,671
    x264          66,298

    Pedro

    ReplyDelete
  16. I don't understand why my post keeps beeing deleted.
    Anyway, thanks Con for your continued work on bfs.

    After reading the comments here and on Alfred Chen's blog about bfs performance issues on linux 4.7.2, I ran some tests on my i5-3210m.
    I post the results in case you find it usefull.

    The tests are :
    build ffmpeg 2.8.7 with j2 and j4, compress the linux 4.4.11 sources with bz2 and xz (multi-threaded), transcode a song with lame and a video with x264.
    Each test is run 3 times and realtime is measured with the time command.

    The linux kernel is 4.7.2, configured with 300Hz timer, dynticks enabled, smt_nice disabled.
    All patches in pending were applied for bfs.

    The average time in seconds are:

    cfs+pstates
    make j2   290,192
    make j4   246,975
    bz2            59,685
    xz            105,030
    lame          12,011
    x264          64,252

    bfs+pstates
    make j2   494,615
    make j4   259,307
    bz2          128,070
    xz            145,811
    lame          28,241
    x264          66,334

    bfs+ondemand
    make j2   376,681
    make j4   246,377
    bz2            86,858
    xz            104,323
    lame          13,364
    x264          64,868

    bfs+performance
    make j2   335,899
    make j4   245,840
    bz2            59,126
    xz            104,875
    lame          11,945
    x264          63,967


    Pedro

    ReplyDelete
  17. And here are the results with the test patch

    bfs+pstates+bfs480-rework_cpufreq.patch
    make j2   506,882
    make j4   266,940
    bz2          128,982
    xz            131,345
    lame          27,671
    x264          66,298

    Pedro

    ReplyDelete
  18. As my first post with all the results wasn't published (maybe it was too big), I'll just post the results with cfs.

    My cpu is a i5-3210m.
    The tests are :
    build ffmpeg 2.8.7 with j2 and j4, compress the linux 4.4.11 sources with bz2 and xz (multi-threaded), transcode a song with lame and a video with x264.
    Each test is run 3 times and realtime is measured with the time command.

    The linux kernel is 4.7.2, configured with 300Hz timer, dynticks enabled, smt_nice disabled.

    The average time in seconds are:

    cfs+pstates
    make j2   290,192
    make j4   246,975
    bz2            59,685
    xz            105,030
    lame          12,011
    x264          64,252

    Thanks Con for your continued work on bfs.

    Pedro

    ReplyDelete
    Replies
    1. Thanks for those. I'm aware of the "performance" regressions as I consciously compromised throughput even further a while back when I added the interactive flag which you can enable and see how that affects throughput. However these are exacerbated by the poor signalling I'm giving cpu frequency which is why I've had to keep working on that component. To that end I have yet another patch to add on top of the rework cpufreq patch.
      http://ck.kolivas.org/patches/bfs/4.0/4.7/Testing/bfs480-more_cpufreq.patch
      As the number of patches has grown a lot I will probably post another full BFS update since 480 had regressions anyway. So please try the extra patch on top of the rest and separately try disabling interactive mode
      echo 0 > /proc/sys/kernel/interactive

      Delete
    2. Ok here are the results with bfs480-more_cpufreq.patch applied

      bfs+pstate+bfs480-rework_cpufreq.patch+bfs480-more_cpufreq.patch
      make j2   522,045
      make j4   278,736
      bz2          129,263
      xz            149,281
      lame          24,166
      x264          67,921

      And the same with interactive=0

      bfs+pstate+bfs480-rework_cpufreq.patch+bfs480-more_cpufreq.patch+interactive=0
      make j2   488,732
      make j4   283,691
      bz2          130,898
      xz            202,841
      lame          30,567
      x264          69,809

      It seems bfs480-more_cpufreq.patch and interactive=0 didn't make any significant difference for the tests I run.

      Let me know if you need more tests.
      And take your time. My system is fine with linux 4.4+bfs :)

      Pedro

      Delete
    3. Thanks. And one more one-liner patch to add on top please.
      http://ck.kolivas.org/patches/bfs/4.0/4.7/Testing/bfs480-revert_idle.patch

      Delete
    4. Ok, thanks.
      I will try this one tomorrow.
      I will also test linux 4.4+bfs with interactive=1.
      I use interactive=0 on my system without feeling any difference compared to interactive=1, but I never did any benchmark.

      Pedro

      Delete
    5. I was hoping that one day I'd improved things enough that I would be able to remove the interactive flag so maybe next release I'll try setting it to zero by default and wait till I get feedback from users.

      Delete
  19. I posted some benchmarks earlier, but apparently the post didn't make it here after all. Anyway, mysteriously I was unable my issues with Xonotic. Nevertheless, I run graphicsmagick benchmark, for single thread (which I'd imagine would be good for cpufreq testing, as the process jumps from core to core ramping up the frequency each time) and one thread per core. Results are here: http://pastebin.com/jBfA1NB3

    Notably, schedutil governor has low performance with 4.7.2-ck2, and results fluctuated a lot between multiple runs. bfs480-rework_cpufreq.patch seems to bring schedutil up to par with ondemand, and benchmark results are more consistent between runs. However bfs480-more_cpufreq.patch seems to re-introduce the fluctuating test results, even though the overall performance seems pretty good.

    CFS seems to perform better than BFS with this test, even with interactive mode disabled.

    ReplyDelete
  20. I posted some benchmarks earlier, but apparently the post didn't make it here after all. Anyway, mysteriously I was unable my issues with Xonotic. Nevertheless, I run graphicsmagick benchmark, for single thread (which I'd imagine would be good for cpufreq testing, as the process jumps from core to core ramping up the frequency each time) and one thread per core. Results are here: http://pastebin.com/jBfA1NB3

    Notably, schedutil governor has low performance with 4.7.2-ck2, and results fluctuated a lot between multiple runs. bfs480-rework_cpufreq.patch seems to bring schedutil up to par with ondemand, and benchmark results are more consistent between runs. However bfs480-more_cpufreq.patch seems to re-introduce the fluctuating test results, even though the overall performance seems pretty good.

    CFS seems to perform better than BFS with this test, even with interactive mode disabled.

    ReplyDelete
    Replies
    1. Blogspot is filtering your posts as spam automatically. I've picked out your original comments and reinstated them now.

      Delete