Vilem Otte said:
Of course you could just dump it either into float4 buffer, or use 3 buffers - one for matrices, another for offsets and another for sizes.
Sometimes it's clear what's better, AoS or SoA.
If we are in a compute shader, and each thread reads from such structs in ordered sequence, like so:
vec2 mOffset = bufferOffsets[x + localThreadIndex];
This SoA layout is faster, because threads read memory sequentially.
Contrary, if we would use AoS like so:
vec2 mOffset = structuredBufferShadowTiles[x + localThreadIndex].mOffset;
It's slower, because the stride of memory access becomes larger. It's no longer tightly and sequentially packed.
(My example assumes we currently only need mOffset to illustrate the difference.)
I have seen cases where the performance difference between AoS vs. SoA was ten times (!), so memory access patterns really matter, and thus we should design memory layout carefully.
I remember GCN was fast for data types like float, vec2 and vec4. For larger things like a 4x4 matrix the benefit becomes much smaller quickly.
But ofc. this mostly applies to compute shaders in practice, because we have precise control over access pattern from all threads.
Assuming ray generation shaders are grouped in tiles, SoA might be a win here too, but ideally we would know and replicate the grouping of threads the GPU does for ray generation shaders precisely, so the access is as sequential as possible over the whole SM / CU.
It also sucks in general that trying out different layouts always is a lot of work.
That said, even if there is a way that SSBO can win over local arrays, it may be be hard to find, even for the experienced expert. And worse: It may differ across various chips.
Btw, beside AoS vs. SoA i had also tried to see a difference from using SSBO vs. images, but that difference was zero in my case.