Back to General and Gameplay Programming

Cache In A Multi-Core Environment

Published June 18, 2015 by Joshua Waring, posted by JoshuaWaring

Do you see issues with this article? Let us know.

Cache In A Multi-Core Environment

In my previous article I discussed the use of cache and some practices that can provide increased performance while also teaching you what cache is. I also stated that cache in a multicore environment is a whole other topic so I've written this article to cover the different considerations that come along with Multicore Programming.

Why does it matter if we're using two cores?

Cache comes in levels, typically 3 each with their own group of cores that can access it, L1 Cache is only visible to a single core with each core having it's own private cache and is the fastest of all caches. L2 Cache is usually visible to a group of cores, for instance the AMD 8150 shares L2 Cache between two cores and finally there's L3 Cache that is accessible to all cores and is the slowest of caches, but still much faster in comparison to RAM. Now that we know that there are different banks of cache for each core, what happens when two cores are accessing the same memory? If there was no system in place then both cores would cache the memory, then lets say one core wrote to that memory, This would be visible in memory, although the other core would still have it's cache of the old value. To solve this when a core writes to it's cached memory; Any other core that stores that cache line will be removed or updated, which is where our problem comes into play. Let's say you have 2 Integers on a single cache line and each core was writing to an integer each that are next to each other in an array, although they're not the same variables and it won't cause any unexpected results, because they're on the same cache line. Every time one core writes to that memory, the other core loses its cache. This is referred to as False Sharing and there's a simple solution to this, the hardest part is determining if you're having this problem.

False Sharing

False Sharing can hinder the performance for any program. For this example I'll go through the optimisations I did on a single producer single consumer queue and provide a few steps to solving most of your False Sharing problems. To test the queue I have two threads, one writing integers from 0 to 1 million and another reading them and checking if they're all in order. The queue doesn't undergo any resizing and is allocated with enough capacity for 1 million objects.


template
class alignas(64) Queue{
    T* data_;
    size_t push_position_;
    size_t pop_position_;
    std::atomic size_;
    size_t capacity_;
};

The problem with this code is that all the variables are packed together with no spacing together, the whole structure would fit on up to two cache lines. This is perfect if we're in a single core environment, although separate cores access pop_position and push_position therefor there's high contention between these cache lines in a multicore Environment. I break the problem down into a shared read section; shared write section and one section for each thread. A section may be larger than a single cache line and may require two cache lines to implement, it's for this reason I call them a section. Our first step would be to determine what memory belongs to what section. With data_ and capacity_ both being shared, but rarely written to, they therefor belong to the shared read cache line, size_ is the only variable that is a shared write and push and pop both belong to their own cache line as each thread uses one each. In this example that leaves us with 4 cache lines. This leaves us with


template
class alignas(64) Queue{
    // Producers C-Line
    size_t push_position_;
    char pad_p[64 - sizeof(size_t)];
    // Consumers C-Line
    size_t pop_position_;
    char pad_c[64 - sizeof(size_t)];
    // Shared Read C-Line
    T* data_;
    size_t capacity_;
    char pad_sr[64 - sizeof(size_t) - sizeof(T*)];
    // Shared Write C-Line
    std::atomic size_;
    char pad_sw[64 - sizeof(std::atomic)];
};

Notice that the alignas(n) this is a new keyword added in C++14. The keyword ensures that the structure is aligned to a multiple of n bytes in memory and therefor allows us to assume that our first variable will be placed at the start of a cache line, which is vital for our separation. Before accounting for False sharing, to push and pop 1 million Integers it took 60ms, but after accounting for False Sharing it's been reduced to 34ms on a Intel Core I5 3210M @ 2.5Ghz. The majority of the time comes from the atomic access, which we use to check to see if there's room to push and anything to pop. You could potentially optimise the atomic access out of most of the pushes by remembering how many objects can be pushed and popped until your next size check, this way we can lower the atomic access and dramatically improve performance again. Example Source While on the same subject of False Sharing, another example would occur when storing data within an array and having a number of threads access that array. Lets think about pool of threads which keep count how much work they've done and store it in an array. We need access to these variables to check how much work's been done while running.


	int work_done[n];

An easy mistake to make, but would result in a plethora of cache misses. As each core goes to increment it's work_done it would invalidate the other cores cache. A solution to this would be to turn the array into an array of pointers to store a pointer to a local variable inside the thread, this would require that we pass a pointer to work_done so we can populate that pointer with the address of the local variable. From a synthetic test where the worker thread is only iterating on work_done, we can see over 5 seconds of iteration across 4 cores we get a result of ~890M iterations per core while once we'd accounted for False Sharing and utilized local variables we get a result of ~1.8B iterations per core which is a ~2x improvement on the I5 3210M @ 2.5Ghz. The same test on an AMD 8150 @ 4.2Ghz reached 44M iterations with False Sharing, while without we reached 2.3B iterations which is a shocking ~52x improvement in speed, I had to double check this result because it's left me in disbelief**! In this case we use a local variable instead of padding between all the variables to save space, but both would work equally as well. Example Source

Summary

Keep classes with multicore access segmented by cache lines to eliminate False Sharing
Local variables are preferred over sharing data outside of the thread

Conclusion

False Sharing can be a problematic side affect of multicore programming which should be a consideration whenever two cores use data in proximity to one another. From these tests on an Intel 3210M we can see that by eliminating False Sharing we receive a ~2x performance boost, obviously this would differ on different hardware.

Notes

* AMD 8150 is tested on Windows 8.1 with Visual Studio 2013 and Intel 3210M is tested on OSX 10.10 LLVM 6.1.0. ** After seeing such a large difference, I went looking for a cause to such dramatic performance loss; I found that the L3 cache on the Bulldozer architecture is broken into 2MB per module (2 cores) that cannot be accessed by other modules [1]. Sharing would result in a cache miss all the way to RAM, while the I5 3210M shares it L3 Cache between all cores and would only need to go to the L3 cache in the case of a cache miss. This wouldn't be a problem if the operating system had placed the threads onto the two cores in a module. I kept running the tests with 2 threads until the result went from 44M to 1.2B per thread, assuming that in these cases the threads were placed on the same module and therefor shared L3 cache. This is a perfect example of the importance of testing your software on different hardware. [1] isscc.org/doc/2011/isscc2011.advanceprogrambooklet_abstracts.pdf pg. 40

0 Likes 8 Comments

Comments

Krypt0n

good article, but " Each core would be writing over each other’s result, because they never saw the other cores work. C++ has a solution to this assuming that you’ve prevented both cores from reading and writing at the exact same time. The volatile keyword tells the compiler ..." is not correct. the solution is called "atomics" (e.g. http://www.cplusplus.com/reference/atomic/ ) and not volatile. volatile just ensures that data is written into memory at all. ... well, it's off topic, you're right :), but that part a little bit hurts the high quality of the rest of your article, maybe you should research and correct it.

June 18, 2015 03:29 PM

JoshuaWaring

good article, but " Each core would be writing over each other’s result, because they never saw the other cores work. C++ has a solution to this assuming that you’ve prevented both cores from reading and writing at the exact same time. The volatile keyword tells the compiler ..." is not correct. the solution is called "atomics" (e.g. http://www.cplusplus.com/reference/atomic/ ) and not volatile. volatile just ensures that data is written into memory at all. ... well, it's off topic, you're right , but that part a little bit hurts the high quality of the rest of your article, maybe you should research and correct it.

I clarified the use of a volatile, what it means for the compiler and underlined the assumption that is required to use it instead of a atomic. It's a pretty big assumption I wouldn't think would apply to many things apart from Boolean flags.

June 18, 2015 03:57 PM

frob

Good article, but even with the cleanup you've done, it still needs more cleanup on the use of volatile.

If you see volatile in modern C++ code it is almost certainly wrong unless you are dealing with an unusual exotic hardware situation, like writing code for a microcontroller. The recommendation "Only use an Atomic when necessary, check if a volatile will meet your needs first" is quite incorrect.

C++'s volatile is very different from other languages like Java and C#. Their volatile is is almost equivalent to C++'s atomic<T>. In C++, the behavior of volatile is almost completely implementation defined, except that it should be left unoptimized. Some compilers add memory blocks, prevent reordering of memory access across the boundaries, or have other guarantees, but other compilers don't. There was a time when volatile was the preferred method for a few specific compilers for restricted access, but those days are past. Today the primary uses of C++ volatile are (1) unusual hardware values that change behind your back, like a sensor that operates by setting a memory value, (2) unusual hardware that modifies your values after write, such as hardware that automatically clamps values within a range, and (3) unusual memory that has multiple addresses for the same object, which was once common but modern flat memory models have nearly eliminated.

The correct tool here is atomic<T>, not volatile. If you have one of those exotic and unusual requirements then the tool is volatile atomic<T> unless your hardware compiler documentation tells you otherwise.

Otherwise, good article. It also suggests the need for a reference to profiling and cache analysis tools.

June 18, 2015 07:14 PM

tivolo

I'm with frob on this one. I have seen far too many abuses of volatile in the past, which horribly failed time and time again.

volatile in C++ was never intended for it to have anything to do with threads. You only ever need it to talk to hardware, e.g. a fixed piece of hardware sitting at a certain address that you're writing to, so you need to use volatile in order to ensure that the compiler won't optimize away certain write accesses. The write-gather pipe in the Wii is such an example.

Furthermore, when using MSVC as a compiler, wrong code often ends up working seemingly fine, because MSVC gives stronger guarantees about what volatile means - guarantees you won't get from other compilers.

Bruce Dawson has some good advice on this subject:

https://msdn.microsoft.com/en-us/library/windows/desktop/ee418650(v=vs.85).aspx

Check the chapter about "Volatile Variables and Reordering".

I also wrote about this in the past on my blog:

https://molecularmusings.wordpress.com/2012/03/05/volatile-thread-synchronization/

June 18, 2015 10:00 PM

JoshuaWaring

I've removed the paragraph about Volatiles, I was hesitant to put them in at first because they're not exactly cache related, but I felt it needed something more than just False Sharing.

For those interested I ran a benchmark and found that there's no real reason to use a Volatile as a flag. The test would read/write 1 million unsigned integers and take the high_resolution_clock time before and after.

Results:

Atomic Reads: 371776

Volatile Reads: 349741

Atomic Writes: 7860217

Volatile Writes: 345923

June 19, 2015 02:37 AM

frob

Maybe on that particular compiler it didn't make a difference, but since volatile's behavior is almost completely implementation defined, that is not always true. Atomic on some platforms has a higher cost where CPU synchronization must take place, and volatile on some compilers means avoiding major optimizations, such as being forced to take expensive and otherwise unnecessary load/store operations after every use.

Either way, the article looks better now without that section. It could still benefit from links to cache analysis tools and profiling, but that's less critical than the updates already done.

June 19, 2015 04:30 PM

JoshuaWaring

@frob I think you've just picked the topic of the next article.

June 20, 2015 01:03 AM

Brain

Today the primary uses of C++ volatile are (1) unusual hardware values that change behind your back, like a sensor that operates by setting a memory value, (2) unusual hardware that modifies your values after write, such as hardware that automatically clamps values within a range, and (3) unusual memory that has multiple addresses for the same object, which was once common but modern flat memory models have nearly eliminated.

One place where you will see volatile used in C++ is in operating systems development, most of the memory mapped registers and ranges used by drivers are volatile for this reason as the might read or write those values but hardware can set them too. This falls under your use case (1) about values changing behind your back, but i wouldn't class PC hardware as particularly unusual, just not a known use case for most application developers who deal in higher level abstractions.

June 22, 2015 11:58 AM

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

Cache In A Multi-Core Environment

Cache In A Multi-Core Environment

Why does it matter if we're using two cores?

False Sharing

Summary

Conclusion

Notes

Comments

Recommended Tutorials

Other Tutorials by JoshuaWaring

Cache In A Multi-Core Environment

Cache In A Multi-Core Environment

Why does it matter if we're using two cores?

False Sharing

Summary

Conclusion

Notes

Comments

Recommended Tutorials

Other Tutorials by JoshuaWaring

Reticulating splines