Sunday 20 July 2008 — This is more than 16 years old. Be careful.
I had an idea this morning that I thought would make my Mandelbrot viewer Aptus run a little bit faster. The compute engine is written in C for speed, but with a Python progress callback function passed in to get updates on the state of the computation. The code is structured like this:
// Code A: .1 sec
for each scanline:
compute the pixels
call the progress function
The progress function is called once per scanline, so for the default 600×600-pixel view, it is called 600 times. Computing this default view takes .1 second.
My idea was to invoke the progress function less often. There’s no need to invoke it as often as 600 times in a tenth of a second. Since the progress function is written in Python, I figured I could save some time by avoiding some of that Python execution.
As a quick test of my thesis, I commented out the progress call entirely:
// Code B: .06 sec
for each scanline:
compute the pixels
//call the progress function
Now the computation took .06 seconds, a significant improvement! It looks like 40% of our time is spent reporting progress through a Python function.
The basic unit of computation for the Mandelbrot set is an iteration, and I was already counting the total number of iterations. So I changed the code to call the progress function only if a minimum number of iterations had been calculated since the last progress call:
// Code C: .1 sec
for each scanline:
compute the pixels
if min_progress_delta (1M) exceeded:
call the progress function
With this code in place, the computation still took the original .1 seconds. That’s odd. The total iterations in this case is 1.9 million, so we only exceed min_progress_delta once, and the progress function is only called once. How can this be? In Code A, we invoke the progress function 600 times, and in Code C we invoke it once, and yet they take the same amount of time. In Code B, we invoke it not at all, and it speeds up by 40%. How can the one call change between Code B and Code C make such a difference?
Odder still, suppose we change min_progress_delta to two million, so that the progress function is never invoked?:
// Code D: .1 sec
for each scanline:
compute the pixels
if min_progress_delta (2M) exceeded:
call the progress function
It still takes .1 second! More experimentation: comment out the call of the uncalled progress function:
// Code E: .06 sec
for each scanline:
compute the pixels
if min_progress_delta (2M) exceeded:
//call the progress function
Now it takes .06 seconds! How can commenting out code that is never called make a difference? We’re starting to zero in on the issue here: we didn’t simply comment out uncalled code, we commented out the entire body of the if clause. And that meant that our C compiler eliminated the test of the if since it was unneeded.
It’s beginning to look like the simple act of doing anything at the bottom of the loop is taking time. That’s the only explanation for the data we have so far.
I’m no expert in these matters, but I’ve read enough about pipelines and caches to know that this is entirely plausible. When the code is uncluttered with detours, it goes much faster than when the end of the loop pauses to consider whether to invoke the progress function. Ironically, actually calling the function and invoking all of the Python overhead is insignificant compared to the time lost to simply deciding (in C!) whether to call it.
Or am I missing something here? Is there a way to invoke my callback function without putting hiccups in my pipeline?
Comments
After speculating for a bit, I stopped being lazy and downloaded the current Aptus source. Nice code.
I see that preparation for the progress checking call involves multiple calls to sprintf. Ouch. That has to be a performance hit. There seem to be plenty of branches in here, on each iteration, so I wouldn't expect one mis-predicted branch is going to hurt so much. (Though I do notice that the progress function might be called more than once per scanline).
Definitely something fishy going on.
Do you get similar results when you run with non-optimised and optimised compiles?
@Alan: after I posted this, I also wondered about how any code executed only 600 times could have such a huge impact. Espectially since there are plenty of other conditionals in the inner loop (once per pixel), which should have a far larger effect. I guess there's still some mystery here...
At first blush from the psudo code I would guess that it is the function call which is causing the problem (not the conditional). The existence of a function call may cause the compiler not to unroll the loop, and also cause more code to be inlined (changing the inline statistics for the higher function).
What you might be seeing is a difference between inlining the the parent function and not. For our engine code base we do a fair amount of inline tweaking to deal with issues like this. That is the only thing which I can think of which could account for the .4sec.
BTW: I tried likely and unlikely also, and looking at the assembler output, it looks like they have no effect. Again, I may not be generating the output properly.
First, the code just runs faster now, about .07 sec for a baseline. Second, the likely/unlikely macros now generate very different code, and the code runs about 4% faster with them than without them.
But we're not out of the woods yet. Code A now runs in .07 sec, but Code B runs in .09!! That is, comment out everything to do with the progress function, and the code runs slower. WTF?
My compiler-foo is weak but I looked at the source and nothing jumped out at me. The inner loop contains similar sprintf/python function call logic so unless that gets optimized out completely I would think that same logic in the outer loop wouldn't be 600 times slower.
unrelated: is the (s == 1 && self->trace_boundary) test repeated twice just for style? Not that it matters for speed, that's the kind of thing compilers optimize easily.
Would it be possible to make a version of Aptus where your c engine calculated the Mandelbrot set from left to right while at the same time a pypy engine calculated the Mandelbrot set from right to left? If c and pypy were exactly the same speed, then they would meet in the exact center of the image. If c were faster than pypy the they would meet somewhere right of center.
This would be a cool visual way to measure the speed improvements of pypy over time.
OK... Another proposal: C calculates top to bottom and pypy calculates bottom to top. Would that be a fair race?
My thought was that the two engines would calculate the pixel values but not actually draw the pixels. If this were the case then each engine could yield pixel values back the main thread which could draw the pixels into a common image. Something like a hybrid car with two engines under one hood.
Final proposal: (for extra credit) Four pixel value calculating engines.
- Divide the image in four quadrants.
- C engine calculates the topLeft quadrant going from top to bottom
- Pypy engine calculates the bottomLeft quadrant going from bottom to top
- Python 2 engine calculates the topRight quadrant going from top to bottom
- Python 3 engine calculates the bottomRight quadrant going from bottom to top
--- Assume that the two CPython engines will be the slowest so give them the easier right half to calculate.
--- Assume that the two CPython engines will be of equals speed but test them side-by-side to verify.
--- All engines would yield pixel values back the main thread which could draw the pixels into a common image.
--- The three Python engines run the same code to keep things equal.
Add a comment: