CPU Cache And How To Work For It

Published June 16, 2015 by Joshua Waring, posted by JoshuaWaring
Do you see issues with this article? Let us know.
Advertisement

Why Cache Matters

Simply put, Cache is a small amount of memory located on your CPU that is used to reduce latencies to main memory (RAM). We have no control over the cache, but programming and storing data in a cache-friendly manner is imperative to performance improvements and, in turn, a reduction in power consumption.

Without realising it, you may have already used a form of Cache before when reading files from a HDD where you have loaded the file into RAM and then manipulated it there before writing back the changes, you've cached the file on the RAM. Although this example is slightly different and technically a buffer, the similarities are present.

How Cache Works

When your CPU requests a piece of memory it'll go out to the RAM to fetch it. During that time the CPU is waiting while the request goes through the Memory Controller, off the CPU chip to the RAM to be served and returned. This is a waste of possible compute time and therefore a reduction in power efficiency.

Cache sits between this line of communication and listens in on the requests to RAM; if it has the memory already it'll stop the request and respond with what it has, this is referred to a Cache Hit and if it's unable to serve the request because it doesn't have the requested memory then it's a Cache Miss.

There are systems in place to assure that when writing to memory, both the Cache and memory are updated, but I'll save those for another article, as they are more important when working with a multithreaded application.

The part that we can assist

The Cache is always trying to guess what memory you'll need to have before you request it, this prediction is called Cache Prefetching. This is why when working on an array it's best to go through in sequential order instead of randomly jumping through, as the Cache Prefetcher will be able to guess what you're using and have it ready before you need it.

Cache loads things in groups of 64 bytes. The size is CPU-dependant and can be checked under your CPU's specification under Cache Line size, although it's typically 64 bytes. This means that if you have an array of integers and you grab 1 of those integers, the cache has also grabbed the Cache Line that it sits on.

Grabbing the next integer stored next to it will be a Cache Hit and subsequently extremely fast. The alignment of the Cache Line will always be a multiple of the Cache Line's size, meaning that if you fetch memory at 0x00 (0) then what will be cached is everything between 0x00 (0) and 0x40 (64) and if you fetch something at 0x4F (79) then you'll get everything between 0x40 (64) and 0x80 (128).

Use Case

Arrays store their data sequentially, so they're ideal for accessing in a cache-friendly way. In this example we access data going across the row before going to the next row and then in the second loop, we go across the column first and then move to the next column. Each dimensions in the array is 10,000 integers.

 for (unsigned x = 0; x < ARRAY_SIZE; ++x) 
 	for (unsigned y = 0; y < ARRAY_SIZE; ++y) 
 		total += array_[ y ][ x ]; 
 		
 for (unsigned x = 0; x < ARRAY_SIZE; ++x) 
 	for (unsigned y = 0; y < ARRAY_SIZE; ++y) 
 		total += array_[ x ][ y ]; 

The only difference between these loops is the array_[x][y] and [y][x], but the results between the two are extreme.

On my AMD 8150, the first loop finished in 1,501ms while the second only took 281ms which is a ~5x speed increase! Simply because of the Prefetcher friendly access, although the first loop may look sequential, it's actually extremely fragmented in its accesses.

A 2D array stores its data in one large block with each row next to each other and after each row it is the next row (Row-Major order).

In the diagram below we have the access patterns of the two loops showing the order of their access and the values returned. The first loop accesses its data in a predictable way, but not a way that the Prefetcher is able to prefetch for. While the second loop accesses its memory sequentially which allows the CPU to prefetch. Example source:

#include <chrono>
#include <iostream>
#include <thread>

const unsigned ARRAY_SIZE = 10000;

int main(){
	unsigned** array1_ = new unsigned*[ARRAY_SIZE]; 
	unsigned** array2_ = new unsigned*[ARRAY_SIZE];
	for (unsigned x = 0; x < ARRAY_SIZE; ++x){
		array1_[x] = new unsigned[ARRAY_SIZE];
		array2_[x] = new unsigned[ARRAY_SIZE]; 
	}
	unsigned long total = 0;

	auto start = std::chrono::high_resolution_clock::now();

	for (unsigned x = 0; x < ARRAY_SIZE; ++x)
		for (unsigned y = 0; y < ARRAY_SIZE; ++y)
			total += array1_[y][x];
	 
	auto end = std::chrono::high_resolution_clock::now();
	auto start2 = std::chrono::high_resolution_clock::now();

	for (unsigned x = 0; x < ARRAY_SIZE; ++x)
		for (unsigned y = 0; y < ARRAY_SIZE; ++y)
			total += array2_[x][y]; 

	auto end2 = std::chrono::high_resolution_clock::now();
 
	for (unsigned x = 0; x < ARRAY_SIZE; ++x){
		delete[] array1_[x];
		delete[] array2_[x];
	}
	delete[] array1_;
	delete[] array2_;
	
	std::cout << "First Loop duration : " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms \n";
	std::cout << "Second Loop duration : " << std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count() << "ms \n";
	std::this_thread::sleep_for(std::chrono::seconds(10));
	 
	return 0;
}

(Access) 1 2 3 4 5 6 7 8 9
(Loop 1) 1 4 7 2 5 8 3 6 9
(Loop 2) 1 2 3 4 5 6 7 8 9

One more example of an optimisation is the layout of a class, assuming that your compiler doesn't optimise this for you.

struct Foo 
{ 
	bool value1; 
	char data[64]; 
	bool value2; 
} 

Let's say we want to check value1 and value2 before accessing any data inside the array. Let's also assume that Foo is stored at 0x00 for simplicity.

When we access value1, we'll have a Cache Miss as it's our first time accessing this piece of memory, but now we've got that line stored for us. We then check value2, which will also result in a Cache Miss, because data has polluted our Cache Line and pushed value2 into the next line meaning we now have to wait twice.

An alternative would be to store value1 and value2 next to each other so that we only have the first Cache Miss and a Cache Hit for the second. Assuming we don't access data[62] and upwards then we won't have to deal with another Cache Miss. Optimally we could store value1 and value2 in the same byte because we only need one bit to store a Boolean and could technically fit eight Booleans into a single byte.

If we performed this optimisation and used bitwise shifting to access each byte we could squeeze another byte of the array into out cache line. Example source:

#include <chrono>
#include <iostream>
#include <thread>

const unsigned ARRAY_SIZE = 5000000;

struct Foo{
	bool value1;
	char data[64];
	bool value2;
};

struct Foo2{
	bool value1;
	bool value2;
	char data[64];
};

int main(){ 
	Foo* array1_ = new Foo[ARRAY_SIZE];
	Foo2* array2_ = new Foo2[ARRAY_SIZE]; 
	unsigned long total = 0;

	auto start = std::chrono::high_resolution_clock::now();

	for (unsigned x = 0; x < ARRAY_SIZE / 2; ++x)
		total += array1_[x * 2].value1 & array1_[x * 2].value2;

	auto end = std::chrono::high_resolution_clock::now();
	auto start2 = std::chrono::high_resolution_clock::now();
	 
	for (unsigned x = 0; x < ARRAY_SIZE / 2; ++x)
		total += array2_[x * 2].value1 & array2_[x * 2].value2;

	auto end2 = std::chrono::high_resolution_clock::now();
	 
	delete[] array1_;
	delete[] array2_;

	std::cout << "First Loop duration : " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms \n";
	std::cout << "Second Loop duration : " << std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count() << "ms \n";
	std::this_thread::sleep_for(std::chrono::seconds(10));

	return 0;
}

From this on the AMD 8150 we recieved 47ms for the first loop and 31ms for the second loop which is a 1.5x performance increase. Notice that we have to times the array access by * 2 to break up sequential access, because there would be no performance difference between the two loops.

If we always fetch 64 bytes then we're bound to fetch more than needed for value2 and therefor cache the next object in the array unintentionally. Another interesting behaviour which shows the function of the prefetcher is if we accessed the array sequentially so the CPU could prefetch it would only take 15ms to go through the entire array instead of 31ms to go through half the array 'randomly'.

Summary

  • Optimally, you want objects to be as small as possible to fit as many in a Cache Line as possible. You can fit two shorts in an integer and therefore 16 integers in a cache line or 32 if you use shorts.
  • Consider a pointer, first to fetch that pointer from memory is a Cache Miss and then fetching that object is a second Cache Miss is a worst case scenario.
  • Keep objects smaller than the Cache Line or divisible by it to prevent a single object spreading over two Cache Lines.
  • C++14 added the aligns keyword that will make sure that these objects are aligned in memory to the value passed which can be beneficial in stopping an object being placed in two Cache Lines.

Conclusion

Cache is more important than ever with the extreme speeds of CPU and relatively slow RAM, great speed improvements or power efficiency can be observed from utilising it correctly which is always welcome in the world of game creation.

For a deeper look into how to you can change your design practices for better Cache utilisation, I'd highly recommend watching this video: https://vimeo.com/97337258 - Scott Meyers - Norwegian Developers Conference

Article Update Log

23 March 2015: Initial release
7 April 2015 : Added source examples

Cancel Save
0 Likes 7 Comments

Comments

alnite

This has potential, but I'd like to see more use-cases. Benchmarking screenshots of the first use-case (the loop) will help too to show that you did get 1500ms vs 57ms.

What is 1 2 3 4 5 6 7 8 9? You need to say that ARRAY_SIZE = 3. A diagram of how the array would be laid-out in memory will help people understand more about the how the cache work. Why is the first form better compared to the second one?

Second example need more thorough explanation. Why is that by storing value1 and value2 next to each other somehow results in a cache miss then a cache hit? Maybe an explanation how the CPU cache work? Or a link to perhaps AMD/Intel website that describes the technicals?

As previously noted, perhaps one or two more use-cases, and more thorough explanation of each use-case, and "before" vs "after" benchmarking results will tremendously help this.

April 06, 2015 06:00 PM
crazeejeeves

The topic is excellent and rarely touched upon, so thanks for posting this.

I agree with alnite that a few additional use cases would be good to further flesh this out. My own challenge at understanding the difficulties managing cache hits/misses is pertaining to more complex scenarios. Defining small data structures and controlling loops seem simple enough, but most products I've worked on also rarely fit the synthetic examples that are commonly included in discussions on this topic, your included.

Pointers especially introduce this headache. How then can the cache be managed in something like a component-based architecture that has entities defined by a collection of shared but non-contiguous elements in memory? I understand that the performance gains are proportional to the number of elements accessed per job unit, but what if a complex data structure as that is actually being accessed regularly per frame? What are my options?

April 07, 2015 10:54 AM
JoshuaWaring

The topic is excellent and rarely touched upon, so thanks for posting this.

I agree with alnite that a few additional use cases would be good to further flesh this out. My own challenge at understanding the difficulties managing cache hits/misses is pertaining to more complex scenarios. Defining small data structures and controlling loops seem simple enough, but most products I've worked on also rarely fit the synthetic examples that are commonly included in discussions on this topic, your included.

Pointers especially introduce this headache. How then can the cache be managed in something like a component-based architecture that has entities defined by a collection of shared but non-contiguous elements in memory? I understand that the performance gains are proportional to the number of elements accessed per job unit, but what if a complex data structure as that is actually being accessed regularly per frame? What are my options?

I'll add this to the article, but a good practice for Component-Based design for cache is to store all components of a single type in a single sequential block of memory, instead of creating enough room for one component per component on the heap. Somewhere there's an example of someone who did this and saw major improvements yet again from simply storing components next to each other.

Edit: (http://gameprogrammingpatterns.com/data-locality.html) this isn't the original page I read this and after skimming I don't think it actually states how much faster it was, but from memory I believe it was at least 20x.

April 07, 2015 11:03 AM
crazeejeeves

Edit: (http://gameprogrammingpatterns.com/data-locality.html) this isn't the original page I read this and after skimming I don't think it actually states how much faster it was, but from memory I believe it was at least 20x.

That's a gem of a link :) Thanks.

April 08, 2015 03:23 AM
Brain
I would like to see further use cases too, maybe some stuff on alignment declarations for structs and classes,also how alignment is important for various types of special values such as sse intrinsics (XMMATRIX and friends) It would also be good to see a section on where it is important to not align structs and how to do it, e.g. #pragma pack for network transmitted and disk stored structs. Apart from that a good article with promise. Looking forwards to future revisions!
April 08, 2015 06:02 PM
ChrisChiu

This essentially falls under the term of "data oriented programming" (or "cache aware programming").

Excellent article series about this:

https://molecularmusings.wordpress.com/2011/11/03/adventures-in-data-oriented-design-part-1-mesh-data-3/

April 09, 2015 07:36 AM
SeanMiddleditch

One more example of an optimisation is the layout of a class, assuming that your compiler doesn’t optimise this for you.


Your compiler will never optimize the given example because it's illegal to do so. C mandates that compilers not move aggregate members around and also requires that the implementation provide consistent rules about padding and alignment and then always apply those (sans any non-standard extensions like #pragma pack).

C++ does allow a limited form of aggregate member reordering only for members in different member access sections. No compiler I've ever seen takes advantage of this, though.
June 29, 2015 07:24 AM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!

Programming and storing data in a cache-friendly manner is imperative to performance improvements and, in turn, a reduction in power consumption

Advertisement

Other Tutorials by JoshuaWaring

Advertisement