Advertisement

data locality design pattern

Started by July 24, 2021 02:33 PM
18 comments, last by frob 3 years, 4 months ago

L. Spiro said:
__builtin_prefetch()

Thanks, but ugh, too low level for me. I wonder if such practices are still commonly used on consoles for example.

Reminds me on some paper where the researches used virtual memory paging tricks to implement a sparse grid for fluid sim, so replacing a tree traversal with HW accelerated resolving of indirections. If i got that right.
Geeky, but i don't dare to go down there :D

As far as I know from ‘modern’ consoles (Sony PS4, semi modern I guess) they meanwhile embedded everything useful into their SDK from compiler atomics over intrinsics up to fine grained thread mangement via fibers

Advertisement

Jumping back to JoeJ's post, I think it's a bit unfortunate that it used particles as an example as it skips over the primary sources of the evil cache misses in most CES (component entity systems like Unity). Using a very simple example of the difference, your particles might be stored as “vector<Particle>” while a game entity is likely to be one indirection away “vector<EntityPtr>”. I.e. vector “of” particles versus vector of “pointers” to entities. This is a core issue in many game engines which causes the inability to handle high entity counts. So, why wouldn't the entities be stored directly in the vector, there are several reasons. Just a few I can name quickly: stable pointers between entities, entities are organized in hierarchies and the vector is just the “root” of a tree, the entity memory is usually from a pool to avoid new/delete and memory fragmentation, and lots of others.

Anyway, iterating through the list of entities in a vector of pointers to entities is the prime cause of cache issues in a game engine trying to push large entity counts. Making things worse is that in the CES style each entity will have a list of components it owns. So, your update loop is something like the following:

void update_entities(vector<EntityPtr>& entities) {
  for (auto& entity : entities) {
    // recurse down
    update_entities(entity->children());
    
    // update each component
    for (auto& component : entity->components()) {
      component->update();
    }
  }
}

This is where a lot of folks would start, hopefully they will optimize it but the point here is the memory access is random at many levels and thus problematic. As has been pointed out, CPU's are fast, really really fast, the ongoing problem is that the CPU performance to memory performance gap continually increases: image. At this point in time, memory access coherency is almost surely the #1 limiting factor in engines that simulate a lot of entities. Add more cores and things go exponential rapidly due to cache line invalidation across cores/chiplets/whatever.

Anyway, thinking in terms of AOS versus SOA at this point isn't going to help much, you have work to do before that comes into play. The first thing is to remove at least some of the random access from this update system. A common starting point is to “not” update the components within the above loop. The loop would simply run through the entity containers deleting any dead entities, inserting newly spawned entities, handling enable/disable flag changes etc. After that completes then the engine updates components per type in tight loops. Because the components are now updated per type, they can be stored in linear memory and iterated without the big cache misses. This puts some limitations on what can be done in components and when data is available but it's nice and consistent so folks quickly get used to it.

If you keep reducing random access like this by flattening the update you will eventually end up with something similar to an ECS. I literally made the step by step transition myself, remove more random access here and there until the resulting architecture was basically an ECS. Completing the transition would be converting the structures from AOS to SOA organization. To be clear, most games that use a CES style setup like the above would probably be fine without doing anything more than moving the components out to flat arrays. Most work is done in components and having them optimized is probably going to be enough for almost any type of game that is not trying to run thousands of entities. You can even multi-thread this to a degree either with shadow writes or by controlling which component updates run in parallel based on what data they read/write. Going to a full ECS thus may not be any great benefit for most people.

The place where fixing the update loop breaks down and the transition to SOA becomes more important is when you start adding AI to the game. Most entity AI requires the ability to reason about it's surroundings, i.e. looking at other entities. The simple case is for an enemy to look at the entities around it and figure out which one is closest. It needs to look at the positions of all the surrounding entities. Unfortunately spatial coherence and memory coherence rarely have anything to do with each other and as such, entity reasoning becomes the next big challenge for memory problems. Moving the entity data to SOA is not a great help for this but it does at least increase the chances of several entities of interest being in the same cache lines. There are other solutions that build on this but at this point it doesn't really matter if you are using unoptimized CES or full blown ECS, it's a completely different problem and the organizing architecture is of little help here.

Hopefully the pedantic breakdown starting at a higher level is helpful.

perry_blueberry said:
So I believe a thread-pool will for most occasions bring boosts in performance. So I would be a proponent of using async/await if possible. But it's just hard to see how that paradigm, let's say in C++, would work with data locality issues.

Hmmm, maybe that's one point to explain why a guy like me is a bit slow with catching up modern language features.

I know creating threads is expensive, and i have my old school job system which avoids this and just works.
Then i skim new language features like std::future and async stuff. But i just assume this creates new threads much more often than needed. I assume it's convenient but also slow, and i ignore the topic. And i keep writing annoying interface callbacks for my job system. Now you mentioning async/wait might be good in context, i conclude i might have missed some things.

Then there were cases of compilers / platform SDKs not having support of modern features or lacking STL. I also shy away because of this and only years later i assume i could use C++11 features finally without worries about portability.

Well, in recent years i worked on tools and adopted many modern features. And yes, it's really helpful and i should spend more time on learning this early.

perry_blueberry said:
Not sure what you mean by eventually… I think it's weird that CS degrees cover BIG O complexity which doesn't care what the branching factor is while it can make a big difference in practice.

I kind of agree with such CS degrees (don't have one myself). Time complexity is what matters the most. Exceptions in practice are specific to HW details, and we just pay more attention than others because we have this realtime constraint in games.
The situation would change if technology would change, e.g. progress on high BW and low latency on chip memory. I don't think the need to be so careful with memory access will go away, but who knows.

perry_blueberry said:
From CS theory it seems parallel processing should be limited by Amdahl's law and not by cache misses.

That's actually still the case in my experience. Mostly we can find a good compromise between work reduction and good memory access patters. And those compromises then do not change time complexity, and so do not justify a preference of brute force over complexity to achieve reduction of work. That's what i meant. E.g., if we see iterating 5000 items is faster than a binary search, and we have 2 million items, we would just use binary search down to groups of 5000 items. And increasing branching factor from 2 to some larger number should help even more if children are sequential in memory.
Both memory access issues and GPU brute force power seemingly lure us into belief brute force is generally the way to go for games, but that's wrong imo.

All8Up said:
I.e. vector “of” particles versus vector of “pointers” to entities. This is a core issue in many game engines which causes the inability to handle high entity counts.

Which is exactly why i have a hard time adopting ECS religion ; )

But i did not intend to touch that with my particle example. Particles are likely implemented within their own system, and won't use a higher level ECS thing which eases gameplay logic and overall design.

But even by working on such lower level particle system (let's extend to actual fluid simulation), there are the same problems:
Particles may interact with each other or a background grid. So to can iterate sequentially, we need to bin them to such grid and reorder them in memory in each update.
After that, we still have the problem of skipping data in our struct, and optimal solution depends on which members the bottleneck algorithms access the most. So the optimal data structures are known only after we are done with all implementation, which might still change over time.
Besides, i can not iterate millions of particles to find close ones, and i can not use a simple regular grid because would take too much memory either. I have to use some sparse grid, which needs some form of traversal and pointer chasing. Doing such traversal per particle would be slow, so in my case i do it once for each neighbor of the current grid cell, cache those pointers, then hundreds of particles in current cell can reuse it.
Finally, because particles move in memory over frames, i can not use their index to address them efficiently. Luckily no issue in this example, but often is.

So, those particles are no bad example. But no matter what we do, we always have those same problems about data locality and resolving indirections. And it always needs specific solutions and extra work to deal with it. Although the general ideas of those solutions are always similar, so far i fail to have much reusable code to help it. It's a bit annoying, but a necessary evil.

JoeJ said:

I know creating threads is expensive, and i have my old school job system which avoids this and just works.
Then i skim new language features like std::future and async stuff. But i just assume this creates new threads much more often than needed. I assume it's convenient but also slow, and i ignore the topic. And i keep writing annoying interface callbacks for my job system. Now you mentioning async/wait might be good in context, i conclude i might have missed some things.

I haven't looked that much into the details of the C++ async support but I know the API looks very similar to the one in Rust so I'm assuming they work somewhat similarly. What I know from the Rust async system is that there is a threadpool with x amount of threads where you can manually set x. Every time you spawn some work the library will look at it's available threads and decide if it wants to put the computation on a freely available thread or interleave it on the same thread. If C++ async works similarly it would not create new threads but stick with the thread pool but I don't know the details of it so I can't say. Without using an async framework, what I think happens is that a developer eventually develops their own in-app framework without realizing by starting long-running threads for blocking behaviors. One thing I think it's cool about async is that it is more general than a thread model. The standard use case is to use it for blocking IO but there is also the possiblity to spawn a CPU-bound computation within the framework.

JoeJ said:

That's actually still the case in my experience. Mostly we can find a good compromise between work reduction and good memory access patters. And those compromises then do not change time complexity, and so do not justify a preference of brute force over complexity to achieve reduction of work. That's what i meant. E.g., if we see iterating 5000 items is faster than a binary search, and we have 2 million items, we would just use binary search down to groups of 5000 items. And increasing branching factor from 2 to some larger number should help even more if children are sequential in memory.
Both memory access issues and GPU brute force power seemingly lure us into belief brute force is generally the way to go for games, but that's wrong imo.

Ok, I misunderstood. So it's like some sort of bucket sort where there is a limit to when data becomes more expensive to go through linearly than in another structure. Thanks for clarifying.

Advertisement

perry_blueberry said:
What I know from the Rust async system is that there is a threadpool with x amount of threads where you can manually set x.

Then i think Rust is ahead, but not sure. I did read C++ aims to add job system, maybe in C++24.

But somebody more ‘up to date’ would need to help out… : )

JoeJ said:

Then i think Rust is ahead, but not sure. I did read C++ aims to add job system, maybe in C++24.

But somebody more ‘up to date’ would need to help out… : )

The interfaces are specified in the standard library, but the runtimes are implemented in user libraries and the APIs have not been stable until recently as far as I know so C++ is probably not that much behind if it's being worked on currently.

perry_blueberry said:
Yeah that sounds a bit hefty to read 3000 pages :o

If you want to start into serious optimizations you need to pay the dues and learn exactly how the hardware does things. There is no shortcut.

In order to get into serious optimizations you need to know many things. You need to know what is going on under the hood. You need to know both the bad patterns and the good patterns, and you need to know how to rewrite code from using bad into using the good. It takes an awful lot of knowledge. Very often you'll be looking at the disassembled code, so being able to read assembly language is an essential skill.

If you follow down the path you'll end up collecting and reading many shelves full of books on the topic, and many collections of PDFs and documents on the topic, in addition to untold numbers of technical websites.

perry_blueberry said:
I just wonder if more modern processors shouldn't be able to handle more advanced access patterns in the cache. Seems there should have been a lot of progress in the algorithms they use to detect these things.

There are many, and they're already in place. Your compiler has many decades of development by experts in the field. The processors, too, have had incremental improvements on what they detect and have been improving for decades.

This is why I wrote what I did about using your profiling tools to actually measure what the performance is. Never guess:

And ALWAYS use your profiler and other tools. Never guess. Measure the actual performance before you make changes, because it may not be as bad as you think. Then measure when you're done to verify you've actually improved things.

You don't know if the systems have already identified the pattern you are using and taken steps to speed it up behind your back.

ALWAYS measure, both before and after, compare the results, and be certain you're actually improving the code in meaningful ways.

This topic is closed to new replies.

Advertisement