Advertisement

no performance gains from morton code vs x + y * width?

Started by February 25, 2019 03:57 PM
7 comments, last by frob 5 years, 9 months ago

 

Hi, I've implemented Morton Code via the 'magic bits' method as mentioned in https://www.forceflow.be/2013/10/07/morton-encodingdecoding-through-bit-interleaving-implementations/  and the real-time collision detection book, but I've yet to see the performance gains that it promises over array[x + y * width] even when increasing sizes to 16k * 16k. Do I have to implement the LUT method? Compiler is MSVC, and the particular problem I've integrated it into is a marching squares opcode determination program. The morton encoding functions are inlined, i've checked.

Seems to me it might not be the best for known/fixed spaces. Has anyone reached similar conclusions?

Are you compiling and timing in release or debug mode?  What level of optimizations are you doing with your release build?  And if you're using debug do not trust your numbers in the least.

"Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety." --Benjamin Franklin

Advertisement
3 hours ago, CrazyCdn said:

Are you compiling and timing in release or debug mode?  What level of optimizations are you doing with your release build?  And if you're using debug do not trust your numbers in the least.

it's release, /O2.

What are the data access patterns of the algorithm? How big is the data set that you're working on? Did you have data before showing that the cache hit ratio was lower than ideal? What's the new cache hit ratio after the change?

On 2/26/2019 at 5:22 PM, Hodgman said:

What are the data access patterns of the algorithm? How big is the data set that you're working on? Did you have data before showing that the cache hit ratio was lower than ideal? What's the new cache hit ratio after the change?

Much of it is opcode generation, where you arrange bits from (x, y), (x + 1, y), (x, y + 1), (x + 1, y + 1) to form a 4bit opcode. Then using this new grid of opcodes you iterate left to right, top to bottom and when find a contour you trace it. Looking at how Morton encoding works, there should be fewer cache misses when you move 'up' the y-axis compared to a grid that has a large width beyond a cache line.

Unfortunately I don't have a profiler that measures cache hits/misses on MSVC2017, if there's a free one that's easy to integrate do tell:)

 

If you're using Morton encoding, then you should never iterate over your grid left-to-right top-to-bottom.  Doing so completely defeats the point of using Morton encoding in the first place.  Instead, iterate in memory layout order.

Given your description of what you're doing with your array, only the actual contour tracing would actually benefit from Morton encoding, and it's unlikely that that's your bottleneck.  But if you're iterating in memory layout order, then at least the other bits won't be any slower than if you didn't use Morton encoding.

Advertisement
18 hours ago, a light breeze said:

If you're using Morton encoding, then you should never iterate over your grid left-to-right top-to-bottom.  Doing so completely defeats the point of using Morton encoding in the first place.  Instead, iterate in memory layout order.

Given your description of what you're doing with your array, only the actual contour tracing would actually benefit from Morton encoding, and it's unlikely that that's your bottleneck.  But if you're iterating in memory layout order, then at least the other bits won't be any slower than if you didn't use Morton encoding.

 

I corrected the iteration order and it's still not showing gains. Looks like morton coding won't be helping for this algorithm. Oh well!

2 hours ago, ddengster said:

Looks like morton coding won't be helping for this algorithm.

One of the important rules of optimization is to measure before you start, ensuring you have a problem that the optimization actually addresses.

Typical grid encoding is a full line at a time.  If you access the cells sequentially and only to the right or left, this is great on the cache.  If you access multiple rows as a set, again working sequentially, this is also great on the cache.  But if you're accessing the cells above, below, or diagonally, and not scanning across the row, the access pattern can be terrible for the CPU cache.

The encoding creates a more cache-friendly pattern for those up/down/diagonal access when not scanning in a row.

If that was not the problem you were having, then the coding probably won't help.  

The problem and the optimization have everything to do with data access patterns.

This topic is closed to new replies.

Advertisement