I'm working on porting 5D noise to HLSL and I have found that there are many places I would simply like to add up all the numbers in a vector or all the numbers in one row of a matrix. Obviously I can do it element at a time, or I guess I can do something like dot(MyVec,1.0), but of course that has the multiplications by 1. Does anyone know a better way to do this? If not, which way would be faster: element at a time, or using the dot? I did scan though the documentation but I didn't see anything obvious. However perhaps I'm missing something.
HLSL question about vectors and matrixes.
Hi,
I think you will only know it for sure by benchmarking it, even though I don't know a good tool to benchmark GPUs.
I can't help you with HLSL but I dealt a lot with CPU sided vectorization over the last year. Let's say the HLSL compiler is not super smart, then the dot product is probably slightly faster since I guess this function is highly optimized. It probably does some vectorized operations like 'multiply - swizzle - add - swizzle - add - extract value' for a vector of length 4. These are six operations in total (not counting the creation of the vec(1,1,1,1) and storing intermediate results) where the swizzles are rather cheap operations. Loading a value from a vector is only cheap for the first vector element. --- However, I can only backup that efficiency claims for CPU vectorization. Things might be different on the GPU. I don't know.
On the other hand, if you write v.x + v.y + v.z + v.w this are 4 extractions and 3 additions. Assuming the same operation efficiency as on the CPU, you have only one cheap extraction of the first vector value (v.x) and then 3 expensive ones + 3 expensive additions. So this solution is probably slower than the dot product. However, if the compiler is smart, it might vectorize the sum and then things are different again.
What might work for a semi-smart compiler is the following pseudo code:
vec4 v = vec4(1,2,3,4);
vec4 tmp = v + vec4(v.y, v.x, v.w, v.z);
vec4 rVec = tmp + vec4(tmp.z, tmp.w, tmp.x, tmp.y);
float result = rVec.x;
In this case, I hope for the compiler to turn vec4(v.y, v.x, v.w, v.z) and vec4(tmp.z, tmp.w, tmp.x, tmp.y) into cheap swizzle operations. This might be the fastest solution if the same rules as for the CPU apply. It would potentially result in the same instructions as the dot product without the multiplication. However, if the dot product has specific hardware support (I don't know), it might still be faster, which brings me back to my initial statement:
You will only know the fastest solution if you benchmark it. In addition, the result of the benchmark might also be hardware dependent
Greetings
It is perfectly fine to use + operators themselfs in an explicitly defined function or in-line, such as
float vectorsum(vec4 v)
{
return (v.x+v.y+v.z+v.w).
}
As far as desktop GPU's, all of the recent ones are scalar in terms of how they process the math for a single thread. So there's no dedicated dot product instruction or anything like that, instead the GPU will do a sequence of multiply-add operations. In this cases the compilers are smart enough to optimize away a multiply by 1, so the hardware will just end up doing 3 add's. Being explicit is always better if you can help it, but I doubt it would matter in this case.
14 hours ago, MJP said:In this cases the compilers are smart enough to optimize away a multiply by 1
Since shaders compile on client every time, I would never advice to count on it.
Thanks guys. For the last few days, I've been suffering so much trying to figure out constant buffer array packing that stopped worrying about about the dot products. I ended up just writing a vectorsum function as @JohnnyCode suggested. I also tried to pack stuff into a 4x4 matrix, do an inversion and then add the rows. I'm not really sure if I'm saving time or costing time, but I guess it works.
Some GPUs will have a "horizontal add" instruction, to quickly sum all elements of a vec4, others won't...
HLSL (and GLSL AFAIK) don't have any way to express this operation directly though... Besides repeated addition or the "dot with 1.0" trick. You just have to hope that the graphics drivers recognise this sequence of HLSL bytecode instructions and can compile it into the appropriate GPU-specific instructions (such as a horizontal add, it it exists...).
AMD actually have some tools where you can provide compiled HLSL code as input, and it will display actual AMD GCN assembly as output, showing what their drivers will do when you load shaders at runtime. I usually wouldn't bother putting in that level of effort unless you're desperate for microseconds though ?
. 22 Racing Series .