JoeJ said:
This makes clear why you want to use constant ram if you can. A situation where this often is no longer possible is having too many lights or bone matrices, or bindless rendering techniques.
Thanks, that makes things a lot more clear. Then the solution here is just to make the cbuffer as large as supported. The number of animated tiles is never going to become really large. I suppose I could use a shader-switch to change the number of supported elements in increments. Do you happen to also know if there is an overhead in having a large cbuffer, lets say 4096-float4 if only 8 of those are currently in use?
JoeJ said:
Further i guess Load turns texture memory access into the same as a general memory access we see with StructuredBuffer. But yeah, not sure. VRAM memory access, related pipelined execution, caching, etc., is where my knowledge is bad. Otherwise GPU performance is easier to predict and understand than CPU perf. to me, because there is no branch prediction, speculative execution and such black boxes.
Interesting, I find CPU-performance much easier to understand. Especially after learning ASM for my JIT-compiler, but even before :D True, there are systems in the background that you don't control directly, but the basics of whats faster seem much clearer to me - use less memory, access memory in a linear fashion, precompute results (as long as it doesn't violate the former two), cache expensive calls in local variables, etc… For GPU, for me, even if I know whats basically the right thing, I sometimes find it hard to execute - MAD-instructions, as an example. The one class we had on (CUDA) compute-shader, where the guy explained how to optimize the performance of a shader by x128 just be changing the way that memory is accessed and how the batches are executed, I couldn't execute myself :D Maybe with a bit more experience -I never got around to implementing compute-shaders in my engine so far, not really all that important for 2d graphics which I'm working on right now.