Advertisement

data locality design pattern

Started by July 24, 2021 02:33 PM
18 comments, last by frob 3 years, 4 months ago

hi.

recently I have been familiarized with data locality design pattern that says its better to handle one object at a time instead of calling them in the same time as it takes time and CPU cycle to fetch each data to the CPU and said that you should program the way that be able to use CPU cache instead of regular memory.

its understandable that to program in a way to don't call many data addresses at the same time but I don't know what is the right way to be more dependant on the cache. as engines like unity use c# that is completely managed, what are all aspects to use this design pattern.

The aspect which bugs me here is that programming languages have no features to address this, preventing optimization which should be easy but isn't. So i'm curious what parctices people use.
Let's make an example of a particle system. A simple particle data structure could like like so:

struct Particle
{
vec3 position;
vec3 velocity;
vec3 color;
float age;
};

Then we implement our stuff, and it works. Could look like this:

Simulate (particles);
Render (particles);

And we see for example simulation takes longer than render, and we ask: Should we optimize our data for better access?

And we try something like this:

struct ParticlePhysicsComponents
{
vec3 position;
vec3 velocity;
};
struct ParticleOtherComponents
{
vec3 color;
float age;
};

We decided to keep age away from the most important data, because we access it only once, while we read pos and vel multiple times. So we want to optimize just for that. And it is faster so we are happy.

But it's ugly. We already have issues with naming things, it's harder to maintain, and in real life we might want to split into 3 or 4 parts, and we may want to profile a lot of options on data layout, and eventually make different decisions later.
This sucks, so mostly we never make it to this point, and keep the first inefficient but practical approach instead.
Or we try to solve it with some nice abstraction:

class Particles
{

struct A
{
vec3 position;
vec3 velocity;
} *a;

struct B
{
vec3 color;
float age;
} *b;

public:
vec3 GetPosition(int index) {return a[index].position;}
vec3 GetVelocity(int index) {return a[index].velocity;}
vec3 GetColor(int index) {return b[index].color;}
// ... add getters and setters for all our data
};

After that, we can change data layout in memory without breaking code which processes it.
So we can find the best compromise between SoA and AoS, but at the cost of some extra work and complexity.

Now i wonder is there a better way to handle this (speaking of C++), and do we have a name for exactly this problem?
Is this also something ECS tries to handle? (I don't think so, as my example is meant about some very low level - only about hardware issues, but not related to any ‘software design and architecture’.)

And finally i hope i'm on topic at all here. ; )

Advertisement

Without knowing a specific thing you're studying, the general advice is to work in batches and keep data linear.

The CPU itself is extremely fast. It is quite hard to be compute bound for CPU work. Far more often performance is bound by data access patterns and algorithmic complexity.

“Do less better” is a mantra I've heard. Don't use 500 calls when 10 will do. Don't use 10 calls when 1 will do. Don't use a complex algorithm when a simple one gets the job done. Every bit of work you do from calling a function to manipulating data takes time, and if you don't need to do it, then don't.

Keeping data linear is all about leveraging your cache. Many algorithms ignore cache effects, and mostly the details are learned through experience and measurement. For example, algorithmically searching through a list the worst thing you can do is a linear search, touching every entry one at a time until you find your goal; it requires less complexity to use a binary search, hopping around until you find the data, or to use a hash container where you do a potentially expensive hash followed by a fast series of lookups. However, due to cache effects and the incredible speed of linear access, it is often faster to do linear searches. For example, searching in a linear list of integer IDs where you're doing simple integer comparisons the cutoff right now is about 5000 entries. Below that and the fastest approach is a linear scan of the data. Above that and you're likely better with a hash if you're randomly accessing, or possibly a binary search if you're likely to have hot-spots in repeated searches.

And ALWAYS use your profiler and other tools. Never guess. Measure the actual performance before you make changes, because it may not be as bad as you think. Then measure when you're done to verify you've actually improved things. Some profilers can detect common issues and suggest improvements for you. Data profilers can monitor cache usage and similarly help identify and correct issues. Leverage those tools, because without them you're only guessing.

@JoeJ I think https://www.youtube.com/watch?v=tGmnZdY5Y-E​ seems similar to what you are mentioning? It's about ECS data locality in Unity as far as I understand it (which I mostly don't).


@frob What you say sounds very interesting. I have actually never heard that a linear search would be faster than a binary search at such a high cutoff. It seems a bit counter-intuitive then that most developer use more advanced algorithms (the ones implemented in standard language libraries). Do you know of any books or talks that go into the theory of this? Or some google-terms?

perry_blueberry said:
It's about ECS data locality in Unity as far as I understand it (which I mostly don't).

I'm not ready yet to adopt ECS religion : ) I know it is related to the problem, but it's still more a higher level programming paradigm, promising to fix issues from ‘bad OOP’ and using cache efficiency just as a bonus on top. But i wonder if modern C++ has some features for such dynamic data structures i might have missed. (I'm only slowly catching up to C++11 and avoid newer things til yet.)

perry_blueberry said:
I have actually never heard that a linear search would be faster than a binary search at such a high cutoff. It seems a bit counter-intuitive then that most developer use more advanced algorithms (the ones implemented in standard language libraries).

It's actually a similar example, where we want to make adjustments based on hardware. To me this often means: Use algorithms like trees still, but eventually have larger sets of content in the nodes, and eventually use trees with larger branching factors as well.
It's also complicated by multi threading in my experience. E.g. you profile and decide each node can have the mentioned 5000 entries, because this was fastest in our single threaded test. But then, with more threads we could have the problem one core already eats almost all memory bandwidth by iterating 5000 entries, and turns out having more cache misses with binary search but using less BW is a net win again.


@frob What you say sounds very interesting. I have actually never heard that a linear search would be faster than a binary search at such a high cutoff. It seems a bit counter-intuitive then that most developer use more advanced algorithms (the ones implemented in standard language libraries). Do you know of any books or talks that go into the theory of this? Or some google-terms?

There are plenty of books and research papers, but they aren't the kind most people want to read. Intel has an enormous online library about the details of chips. The last one I read completely was the Pentium 4, split into four volumes and about 3000 pages of technical details. In the decades since I looked things up as needed. If you want, start at https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html

People seriously underestimate just how fast a CPU is when it is kept fed with data efficiently. For simple integer operations it can reach about 12 operations per cycle. (Some of the execution ports in the system can do multiple OOO core operations per cycle.) for simple data access within a cache line you pay by the line, use them all and you can do 8 or 16 in the same time you could do 1. Obvious linear access is easily detected and the prefetcher completely eliminates memory access time beyond the first. In an ideal (and rarely encountered) scenario you can run through megabytes in microseconds. Or fail it and need to pull everything from main memory by jumping around, completely missing L1, L2, and L3 caches and you can process dozens of bytes in microseconds.

Advertisement

JoeJ said:

I'm not ready yet to adopt ECS religion : ) I know it is related to the problem, but it's still more a higher level programming paradigm, promising to fix issues from ‘bad OOP’ and using cache efficiency just as a bonus on top. But i wonder if modern C++ has some features for such dynamic data structures i might have missed. (I'm only slowly catching up to C++11 and avoid newer things til yet.)

To me ECS looks (from that presentation) very similar to functional and structural programming. But from my experience trying to write both structural and functional code in an OOP language I've realized that it's best to work with the tools and not against them. So I don't really understand why they are going for this paradigm when they have an OOP language. Obviously very time consuming, but I think if they want to actually go for ECS in Unity they should re-write it in a language that is better suited for it.
I would say you should definitely use the modern features of C++. If I didn't have those available I would never look at that language. :p I think they bring the language closer to a language like Rust which is such a treat to program in.

It's actually a similar example, where we want to make adjustments based on hardware. To me this often means: Use algorithms like trees still, but eventually have larger sets of content in the nodes, and eventually use trees with larger branching factors as well.
It's also complicated by multi threading in my experience. E.g. you profile and decide each node can have the mentioned 5000 entries, because this was fastest in our single threaded test. But then, with more threads we could have the problem one core already eats almost all memory bandwidth by iterating 5000 entries, and turns out having more cache misses with binary search but using less BW is a net win again.

Not sure what you mean by eventually… I think it's weird that CS degrees cover BIG O complexity which doesn't care what the branching factor is while it can make a big difference in practice.

From CS theory it seems parallel processing should be limited by Amdahl's law and not by cache misses. Functional languages seem to have an advantage here since thread don't have to worry about whether other threads are changing data since it's all immutable. I know from experiments in a course that for most problems the overhead of starting lots of threads is quite high. So I believe a thread-pool will for most occasions bring boosts in performance. So I would be a proponent of using async/await if possible. But it's just hard to see how that paradigm, let's say in C++, would work with data locality issues.

frob said:

There are plenty of books and research papers, but they aren't the kind most people want to read. Intel has an enormous online library about the details of chips. The last one I read completely was the Pentium 4, split into four volumes and about 3000 pages of technical details. In the decades since I looked things up as needed. If you want, start at https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html

People seriously underestimate just how fast a CPU is when it is kept fed with data efficiently. For simple integer operations it can reach about 12 operations per cycle. (Some of the execution ports in the system can do multiple OOO core operations per cycle.) for simple data access within a cache line you pay by the line, use them all and you can do 8 or 16 in the same time you could do 1. Obvious linear access is easily detected and the prefetcher completely eliminates memory access time beyond the first. In an ideal (and rarely encountered) scenario you can run through megabytes in microseconds. Or fail it and need to pull everything from main memory by jumping around, completely missing L1, L2, and L3 caches and you can process dozens of bytes in microseconds.

Yeah that sounds a bit hefty to read 3000 pages :o
To me when I've listened to hardware discussions (mostly from Jim Keller on Youtube) it seems that some of these things (probably things like linear access) have been trivially solved for a long time. Like they got 90% of the way very quickly. I just wonder if more modern processors shouldn't be able to handle more advanced access patterns in the cache. Seems there should have been a lot of progress in the algorithms they use to detect these things. But I'm certainly no expert at hardware… It just seems unnecessary that programmers should be more-or-less experts on parts of the hardware when writing something so high-level as a game.

However, I will definitely try to read up more about the cache with the things you mentioned!

Also don't underestimate the compiler in optimizing your data. As an example, when I wrote a matrix multiplication function for our engine, the “nice and clean” way of doing it in a double loop was very slow. The “not so nice” way doing it in just a single loop on vec4 instances brought a massive speed improvement. The trick was to help the compiler and optimizer to udnerstand that all data are accessible in a linear fashion. From there, the optimizer was clever enougth to use implicit intrinsics in order to batch process on the CPU.

This lead me to provide as much data in batches as possible whenever possible. I'm for example using a linear hash container instead of the usual ones.

The 101 of CPU efficient data layouting also contains the chapter of memory alginment. Most CPUs prefer aligned memory, even if it is provided as a batch, can otherwise cause a lot of cache lookups

JoeJ said:
But i wonder if modern C++ has some features for such dynamic data structures i might have missed.

Rules for caching are allowed to vary between different architectures, so it would be misplaced to have this as a feature of the language, however your compiler should have intrinsics such as __builtin_prefetch() that you can leverage.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

This topic is closed to new replies.

Advertisement