4 hours ago, ahmadierfan99 said:
And do you believe that in the range of today's modern CPUs, if an algorithm is better than the other, it also performs better on other similar hardwares?
It gets complicated really quickly.
At the most basic level, start with the general timings every programmer should learn, although they're a little old. Intel and AMD publish timings for every now chip.
At a very rough and imprecise level, on a Coffee Lake 5GHz machine accessing a value in a register takes about 0.20 nanoseconds, hitting L1 cache takes about 1.2-2.1 ns, hitting L2 cache takes about 3.0-5.3 ns, hitting an unshared L3 cache line takes 12.0-21.4 ns, a shared clean L3 line is 19.5-34.8 ns, a shared dirty L3 line is 22.5-40.2 ns, and a remote L3 line is 30.0-160.7 ns. Accessing out to DRAM depends on the memory chips, the motherboard, bios settings, and more, but reaching down to about 60 ns for low latency chips, around 100ns for common chips, and longer for the cheap stuff. Branching in the CPU takes time, but navigating the CPU pipeline can get complicated, too. Near the ideal you can hit about 4 CPU instructions per CPU cycle. The out of order core means you can potentially have multiple operations in flight at once, so even though there is memory latency you can have multiple in progress and have different limits.
Start with those, and do a logical exercise about prediction, caching, and access times.
Let's take an array of 10,000 integers. That's very near the modern cut-off for several of those algorithms.
A linear search of simple integers will be perfectly predicted within just a few instructions, so it will quickly approach ideal. Assuming the CPU is perfectly fed with data in a tight loop, it can approach 0.1 nanosecond per integer. If you use SIMD instructions, it can reach even lower numbers, but we won't address them. A linear search hits half on average, so the average time will be around 500 nanoseconds. That will be about 500 nanoseconds no matter the configuration, warming the cache would be close to consistent across them all.
Now let's consider a binary search. The loop may be in the instruction cache, but for large data every item will trigger a cache miss. There will be about 14 comparisons, each is a cache miss to either L3 or main memory; in the best case on the highly clocked machine 14 L3 lookups is about 168 ns, if lines are shared perhaps 487 ns, if it has to go out to main memory perhaps 1400 ns. So there is enormous variability, anywhere from about 170 ns to 1400 ns depending on the state of the machine.
Further complicating it, between runs probably the first few runs will enter L1 cache since they're hit multiple times, but others will need lookup.
So without any context, that same binary search might be roughly 1/4 the time of a linear search, or it might be 3x the time of a linear search, or anywhere in between.
Change to a CPU with a bigger L1 or L2 cache that contains the entire collection, and the tradeoff changes again. If everything fits in the L2 cache instead of taking 160 to 1400 nanoseconds, it plummets to about 50 nanoseconds. Now it's 1/10 of the time of a linear search, much faster than the alternative.
Red-Black tree searches are a little more complex, the tree depth is shallower and they rely on multiple pieces of information depending on the color of the node. With a degree of luck hitting the value will also fetch the node color, and the cost is the same as a regular binary search. There is potentially a little more work, so we'll say it is at least the time of a binary search and potentially somewhat more. Still, in the range of a quarter the time of linear up through three times the time of a linear search, or much faster if it all fits in L2 cache, all depending on external factors.
Hash tables have the cost of computing the hash, then the cost of a memory lookup, then the check for the collision. There's the cost of fetching the item which we'll call 100 nanoseconds in main memory or 30 ns in L3 cache, plus the cost of computing the hash. The time to compute the hash will probably be lower than 400 ns, so the two times added together will probably be lower than the 500 ns average time for a linear search.
You can do similar thought experiments with each data structure. Change the sizes and different values make sense. For an arrays under about 5000 items, a linear search is generally fasted on average. Other algorithms become better as the list grows, with hash tables quickly emerging as the best alternative.
Next, if you're doing something other than integers, you change everything. If comparison calls a function you have function call overhead for each comparison, making a linear search far more expensive; instead of virtually free inside the OOO Core at around 0.1 nanoseconds, they reach about 5-10 nanoseconds for the function call. So instead of testing 5000 items on average at 0.1 ns, you're testing 5000 items on average at around 7 nanoseconds. And the 14 lookups also add another 7 nanoseconds or so each for their function calls. Suddenly the balance point where one or the other is better for a binary search drops to around 500 items in the array.
And with all of that, searches generally aren't the big bottlenecks of modern games. They contribute to the processing time, but they aren't the big bottlenecks.