33 minutes ago, noodleBowl said:
I'm not sure if this is the problem. I might still be doing it wrong, but I went back and attempted to use the result, but I'm still get 0ms ![:/ :/](https://uploads.gamedev.net/uploads/emoticons/xunsure.png.pagespeed.ic.Huvv5Bac_n.webp)
//Setup code for the data is here
QueryPerformanceCounter(&startTime);
for (int i = 0; i < INT_MAX; ++i)
{
for (int j = 0; j < 4; ++j)
{
//smMat (input matrix), smVec (input vector), smRes (output result)
simdMul(smMat, smVec, smRes);
}
}
QueryPerformanceCounter(&endTime);
std::cout << "New RAW SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000000) / (double)frq.QuadPart) + "micro" << std::endl;
//Attempt to use the result
float rData[4];
_mm_store_ps(rData, smRes.data);
You're still not using the result.
printf() rData[0] through rData[3] so it is used.
And source your input from something unknown, like argv from main or by reading from a file; else the optimizer may perform the code at compile time and hardcode everything (since everything could be otherwise be resolved at compile time rather than calculating it at runtime).
33 minutes ago, noodleBowl said:
Can you explain what a wavefront, warp, and bank conflict is? I never really heard of these terms before
Sure. I can save you some time by telling you warp and wavefront are synonims. A warp is how NVIDIA marketing dept calls them, a wavefront is how AMD's marketing dept calls them.
You can also check my Where do I start Graphics Programming post for resources. Particularly the "Very technical about GPUs." section. You may find the one that says "Latency hiding in GCN" relevant.
As for bank conflicts... you got it quite right. Memory is subdivided into banks. If all threads in a wavefront access the same bank, all is ok. If each thread access a different bank, all is ok. But if some of the threads access the same bank, then things get slower.
33 minutes ago, noodleBowl said:
I'm not really sure how I feel about this, because a part of me says "good way better!" and the other part says "it could better...". But I feel like in order to get it to be better I need to start exploring different methods such as the ones described by @Matias Goldberg and others. Also looking into using multithreading, since I'm definitely only using a single thread
I was just trying to point out there's always a more efficient way to do things.
But you're doing a videogame. At some point you have to say "STOP!" to yourself, or else you'll end up in an endless spiral of constantly reworking things out and never finishing your game. There's always a better way. You need to know when something is good enough.
I suggest you go for the vertex shader option I gave you (inPos * worldMatrix[vertexId / 6u]). If that gives you enough performance for what you need, move on.
Trying Compute Shader approaches also restricts your target (e.g. older GPUs won't be able to run it) which is a terrible idea for a 2D game.