Advertisement

Benchmark for Search Algorithms in Common Games Development Data

Started by May 01, 2019 11:12 AM
14 comments, last by lawnjelly 5 years, 6 months ago

I'm writing a benchmark for on data-oriented search algorithms and data structures in full details when there is a high need for efficient searching each frame and about what to choose depending on

  1. How cache friendly it can be
  2. Algorithm's order
  3. Insertion/Deletion frequency (how often data needs to change each frame)

I have these Data Structures in mind right now

  • Array (linear)
  • Binary search trees
  • B-trees
  • Red Black Trees
  • Hash Tables
  • Bloom Filters

I'm looking for common data used in games or graphics programming that is frequently modified and searched for to complete my benchmark.

Data-Oriented Book uses Simple Toy Animation Data to benchmark linear search and binary search but I believe there's some data more challenging.

 

What's the benefit you're looking for?

Are you looking for a personal benefit? Or are you looking to try to provide a resource for others?

If you're learning for your own benefit then go for it. But if you're trying to make a resource for others, unless you've got a pile of credentials that I'm not seeing, please don't clutter up the resources with non-rigorous publications.

All of those data structures are heavily studied in computer science, and except for bloom filters which are more recent, have been core CS topics since the 1970s or earlier. Cache effects have been studied since CPU caches were first introduced.  Many student researchers have major errors, such as not knowing how to write proper benchmarking utilities, failing to take into account the actual machine code, and failing to properly understand the hardware involved, but professional researches have published solid numbers and precise academic research for approaching five decades since caches were first introduced.

Cache effectiveness is highly dependent on specific implementation details of the hardware, and specific implementation details of the algorithm, and specific implementation details of the data structure, and specific implementation details of the compiler and optimizer.

Changing any of those highly specific implementation details can completely invalidate the results. A structure changing size, or changing content, or changing packing, or changing the instruction ordering, or hoisting an access outside a loop, or unrolling a loop, or changing the register coloring rules, all can cause a shuffle that invalidates cache efficiency results.

Two different CPUs have different cache invalidation rules, coherency rules, and consistency rules, they have different associativity rules, have different prediction and prefetch rules, and more. Data that is highly cacheable on one CPU may struggle to be cached at all on a different CPU. Even if you look at a single processor family, say the current Intel Coffee Lake series, switching from an Intel 8300 to an Intel 8400 can produce radically different cache effects.

 

 

Advertisement

Thank you very much @frob.

You're right and I know most of the things you pointed out.

It's mostly personal for me but I'll definitely talk to other more professional than me to and consult them about publishing it, if I wanted to.

You're right but there is a finite set of hardware I'm targeting and as always the most important aspect of any program is hardware since it's the actual platform.

In my mind the first try is MSVC and yes there are difference between compilers and optimizations they do.

I will definitely test GCC, Clang and other compilers and thanks for pointing it out.

Now I just want a challenging practical data to work with and not some toy data I come up with.

Thanks 

I guess what I'm trying to say is that a general benchmark is near-useless even if you pulled examples from live running games.

This isn't about what data set you use, it is because in the real world the actual situation is so complex and dependent on so many factors that anything general purpose is meaningless.

You could publish something like: this specific code on this specific CPU compiled with this specific compiler and this specific set of options and this specific data set, we saw this result. 

Even seemingly small compiler options make a difference. Changing the COMDAT structure merge compiler optimization, or toggling string pooling, or changing the generation of stack probes, or the presence/absence of frame pointers, changing any of them can give radically different cache use results.

Changing CPUs, even to a neighboring CPU in the CPU family can change it.  Your results on Intel 8300 chip are going to be different from the results on an Intel 8400 chips. Cache sizes are different, prediction algorithms are different, and cache coherency and consistency rules are different between CPUs.

Even changing how frequently you do operations makes a difference.  Perhaps one task if you do it every graphics frame you see different results from if you do the operation every other graphics frame.  There are many cache systems, and different things can evict or persist data in a cache. It isn't just L1, L2, and L3 cache sizes, but also the branch prediction cache and branch table buffers, the prefetch cache, and also data that happened to not be evicted yet from previous runs can all make a difference.

The theoretical maximums and typical ideals are already well-covered in literature. The specific benchmarks are so closely related to actual use patterns that they're meaningless to anyone else. I'm not sure what you are hoping to add.

Well now I know what I'm going to do thanks to you.

I'm hoping to practice and challenge my self to optimize algorithms on my CPU and try to understand the modern hardware better than before and try on different CPUs and Compilers to see the differences and learn as much as possible.

Thank you for setting my mindset.

And do you believe that in the range of today's modern CPUs, if an algorithm is better than the other, it also performs better on other similar hardwares?

Or do you believe it's possible the results can be reversed changing from 8300 to 8400 (same Cache Line size)?

 

 

Judging from what frob said so far, I would say, yes it's possible.

There is no algorithm that outperforms all others in every case, I don't see why such an algorithm would exist if you change hardware, given that pretty much any change in setup can have dramatically different results. Likely different algorithms are affected in a different way, so a bad algorithm at system A, may perform better at system B than the superior algorithm at system A, since the latter got hit hard by differences in hardware or setup.

I do hope you realize that exploiting hardware to the limits is generally not going to win the war right? You win wars by having effective algorithms doing the work.

 

Advertisement
2 hours ago, Alberth said:

There is no algorithm that outperforms all others in every case

100% Agreed. That's why I'm looking for a specific case and data to work with which the results can't prove anything for other cases but with what @frob said the problem is there are different compilers and hardware. My Question is does that make much difference in modern CPU's and not worth doing the benchmark?

Quote

I do hope you realize that exploiting hardware to the limits is generally not going to win the war right? You win wars by having effective algorithms doing the work.

0% Agreed Since our program runs on hardware It's definitely worth knowing modern CPU's and using them the best way we can.

Here is a quote from dod book (which I recommend reading if you haven't)

Quote

 

you can see it uses a linear search instead of a binary search, and yet it still manages to make the original binary search look slow by comparison, and we must assume, as with most things on modern machines, it is because the path the code is taking is using the resources better, rather than being better in a theoretical way, or using fewer instructions.



i5-4430 @ 3.00GHz
Average 13.71ms [Full anim key - linear search]
Average 11.13ms [Full anim key - binary search]
Average  8.23ms [Data only key - linear search]
Average  7.79ms [Data only key - binary search]
Average  1.63ms [Pre-indexed - binary search]
Average  1.45ms [Pre-indexed - linear search]

 



 

With this in mind, I know there is no general solution to everything that would be the opposite of my beliefs which is data-oriented, BUT there are some aspects of modern hardware that are very similar and we can use the best from It.

 

Honestly, with what @frob said so far which is really helpful. I'm in doubt to continue this project for my learning purposes and I don't think(doubt) it's useless and not worth looking at when it's done

 



 
4 hours ago, ahmadierfan99 said:

And do you believe that in the range of today's modern CPUs, if an algorithm is better than the other, it also performs better on other similar hardwares?

It gets complicated really quickly.

At the most basic level, start with the general timings every programmer should learn, although they're a little old. Intel and AMD publish timings for every now chip.

At a very rough and imprecise level, on a Coffee Lake 5GHz machine accessing a value in a register takes about 0.20 nanoseconds, hitting L1 cache takes about 1.2-2.1 ns, hitting L2 cache takes about 3.0-5.3 ns, hitting an unshared L3 cache line takes 12.0-21.4 ns, a shared clean L3 line is 19.5-34.8 ns, a shared dirty L3 line is 22.5-40.2 ns, and a remote L3 line is 30.0-160.7 ns.  Accessing out to DRAM depends on the memory chips, the motherboard, bios settings, and more, but reaching down to about 60 ns for low latency chips, around 100ns for common chips, and longer for the cheap stuff.  Branching in the CPU takes time, but navigating the CPU pipeline can get complicated, too.  Near the ideal you can hit about 4 CPU instructions per CPU cycle. The out of order core means you can potentially have multiple operations in flight at once, so even though there is memory latency you can have multiple in progress and have different limits.

Start with those, and do a logical exercise about prediction, caching, and access times.

Let's take an array of 10,000 integers.  That's very near the modern cut-off for several of those algorithms.

A linear search of simple integers will be perfectly predicted within just a few instructions, so it will quickly approach ideal. Assuming the CPU is perfectly fed with data in a tight loop, it can approach 0.1 nanosecond per integer. If you use SIMD instructions, it can reach even lower numbers, but we won't address them.  A linear search hits half on average, so the average time will be around 500 nanoseconds.  That will be about 500 nanoseconds no matter the configuration, warming the cache would be close to consistent across them all.

Now let's consider a binary search. The loop may be in the instruction cache, but for large data every item will trigger a cache miss. There will be about 14 comparisons, each is a cache miss to either L3 or main memory; in the best case on the highly clocked machine 14 L3 lookups is about 168 ns, if lines are shared perhaps 487 ns, if it has to go out to main memory perhaps 1400 ns.  So there is enormous variability, anywhere from about 170 ns to 1400 ns depending on the state of the machine.  

Further complicating it, between runs probably the first few runs will enter L1 cache since they're hit multiple times, but others will need lookup.

So without any context, that same binary search might be roughly 1/4 the time of a linear search, or it might be 3x the time of a linear search, or anywhere in between.

Change to a CPU with a bigger L1 or L2 cache that contains the entire collection, and the tradeoff changes again. If everything fits in the L2 cache instead of taking 160 to 1400 nanoseconds, it plummets to about 50 nanoseconds. Now it's 1/10 of the time of a linear search, much faster than the alternative.

Red-Black tree searches are a little more complex, the tree depth is shallower and they rely on multiple pieces of information depending on the color of the node. With a degree of luck hitting the value will also fetch the node color, and the cost is the same as a regular binary search. There is potentially a little more work, so we'll say it is at least the time of a binary search and potentially somewhat more. Still, in the range of a quarter the time of linear up through three times the time of a linear search, or much faster if it all fits in L2 cache, all depending on external factors.

Hash tables have the cost of computing the hash, then the cost of a memory lookup, then the check for the collision. There's the cost of fetching the item which we'll call 100 nanoseconds in main memory or 30 ns in L3 cache, plus the cost of computing the hash.   The time to compute the hash will probably be lower than 400 ns, so the two times added together will probably be lower than the 500 ns average time for a linear search.

You can do similar thought experiments with each data structure. Change the sizes and different values make sense. For an arrays under about 5000 items, a linear search is generally fasted on average.  Other algorithms become better as the list grows, with hash tables quickly emerging as the best alternative. 

Next, if you're doing something other than integers,  you change everything.  If comparison calls a function you have function call overhead for each comparison, making a linear search far more expensive; instead of virtually free inside the OOO Core at around 0.1 nanoseconds, they reach about 5-10 nanoseconds for the function call. So instead of testing 5000 items on average at 0.1 ns, you're testing 5000 items on average at around 7 nanoseconds. And the 14 lookups also add another 7 nanoseconds or so each for their function calls. Suddenly the balance point where one or the other is better for a binary search drops to around 500 items in the array.

 

And with all of that, searches generally aren't the big bottlenecks of modern games.  They contribute to the processing time, but they aren't the big bottlenecks. 

Quote

So without any context, that same binary search might be roughly 1/4 the time of a linear search, or it might be 3x the time of a linear search, or anywhere in between.

That's really amazing and I didn't think about it that way. Although in Games you know how big your data is going to be and you know the hardwares it's going to run on.

Quote

And with all of that, searches generally aren't the big bottlenecks of modern games.  They contribute to the processing time, but they aren't the big bottlenecks. 

That's why It's more than a hobby project for me to learn more.

There is a lot to learn from CPU's architectures and Assembly Language to Operating Systems and Compilers and a lot of other things.

Thank you, It's been really helpful knowing there is a lot more about what I don't know and need to know before starting these kinds of projects and tests.

Are there any other resources and practices you suggest to help me reason about the hardware better?

1 hour ago, ahmadierfan99 said:

Although in Games you know how big your data is going to be and you know the hardwares it's going to run on.

Neither of these are always (or even often) true. Not knowing the hardware is rather common now with multiplatform being the norm.

1 hour ago, ahmadierfan99 said:

Are there any other resources and practices you suggest to help me reason about the hardware better?

Hardware varies a lot and it is subject to change. Trying to make broad generalisations may have worked maybe 35 years ago but modern CPUs do all kinds of craziness, out of order pipelines etc, which are probably not public info. And what they do depends on what they are trying to optimize - speed, or power use etc.

On top of that the best choice of algorithm often depends on the data itself, which is why compression for example might use different techniques depending on the data in a block.

Generally the rule of thumb these days with any optimization is to do empirical testing, rather than come up with an a priori method. Computers are now complex systems, and hoping to plug in some values into a formula and come up with how well an algorithm will work on any particular hardware is often less practical than just trying the thing with some test data, and modifying the technique accordingly.

This topic is closed to new replies.

Advertisement