I remember I did safe branching in SIMD with the instructions of intel that take an extra vector for the decision making. But it is only good for few situations. I don't remember what situations exactly I used it in.
Are there some memory limits to consider when writing a shader?
20 minutes ago, l0calh05t said:Many vector instruction sets offer masked operations nowadays. And with instructions like movemask you can make sure that only those branches that are in use are evaluated. So it really is more of a programming model thing than actual differences in hardware.
You could implement SIMT with this functionality but its still a level below SIMT. It is a difference in hardware since the hardware on a GPU handles branch divergence in hardware AFAIK. But then again I've never hacked GPU assembly so you never know, but I think I've read GPU's automatically handles this for you.
-potential energy is easily made kinetic-
On 22.9.2017 at 12:36 AM, NikiTo said:Would it be a problem to create in HLSL ~50 uninitialized arrays of ~300000 cells each and then use them for my algorithm
On GPU you have only 32 kb of fast LDS memory, that.s not enough fot you, so you need to use global device memory. But if you launch e.g. 1000 thredgroups, each consisting 64 (can be 32 up to 1024) threads, you would need to allocate 1000 * 64 * 300000 cells id you need unique 300000 cells for each thread (beacause they all may run in parallel). That's the main limitation you need to think about.
55 minutes ago, JoeJ said:so you need to use global device memory.
I knew I forgot something in my reply... the original question was about shaders not the CPU implementation.
-potential energy is easily made kinetic-
My initial intention was to run millions of those threads on the GPU...
Thank you all! Now I have it much clearer.
On 9/23/2017 at 4:16 AM, Infinisearch said:You could implement SIMT with this functionality but its still a level below SIMT. It is a difference in hardware since the hardware on a GPU handles branch divergence in hardware AFAIK. But then again I've never hacked GPU assembly so you never know, but I think I've read GPU's automatically handles this for you.
Sorry for bringing up an old topic but I realized I was in error at the time of writing this. Even if you could lane mask in a SIMD architecture a vector register lets say r5 refers to a different register per lane in a SIMT architecture while in a SIMD architecture the r5 in an instruction would refer to the same register per lane. So in an SIMT architecture a vector register would refer to the SIMT width number of different registers, while in SIMD it would all refer to the same register. I suppose if you performed some sort of gather/scatter operation on a SIMD register it would be possible to simulate having a different register in each component of a SIMD register. But that seems like an inefficient way of doing things.
Edited for clarity.
-potential energy is easily made kinetic-
Sorry for bumping this again but I again think I'm wrong but this time in my above post. Each element of a SIMD vector would be the same register for a different thread. So vR5 + vR6 would work on the same registers of a n threads for n wide SIMD. For some reason I got confused and thought that the layout of the registers had to be t1vr1,t1vr2,t1vr3... and forgot about the possibility of a organization of t1vr1,t2vr1,t3vr1. My mistake... sorry for any confusion.
-potential energy is easily made kinetic-
I am sorry for bumping this yet again, but I think it is better than starting a new topic.
I am now having to organize my chunks of workloads for the compute engine.
My AMD says wave line count MIN is 64(and MAX too).
I feel comfortable with numthreads(256, x, 1).
If I understand it well, the wave line count of 64 means that I should try to NOT dispatch less than 64 threads in a group.
And if I understand it well, the max number of threads in a group depends of that 1024 limitation of the Model Shader 5 and the amount of the shared memory.
I mean if the threads in a group, all together don't use more memory than the amount of the shared memory, I am ok.
It is very important for me to design now the groups dimensions. Latter I will try to use some AMD extensions in the shaders and try to communicate between threads.
Is that communication between threads limited to the wave width or to the number of threads in the group? I mean, can I broadcast a value to 1024 threads at once, or only to 64?
It would help me a lot to know these things beforehand.
One CU has 64 cores and those process larger groups in alternating order, so a 256,1,1 group processes an instruction in 4 steps. But you never notice that as a programmer, with the exception that if all 64 threads skip over a block of code within a larger group, you get a speed up. So it makes sense to compact work sometimes.
All threads in a group can communicate by using shared memory (LDS), which is quite fast. This works without broadcast extensions. Extensions are limited to the 64 threads of a wavefront even if the group is larger.
3 hours ago, NikiTo said:And if I understand it well, the max number of threads in a group depends of that 1024 limitation of the Model Shader 5 and the amount of the shared memory.
Yes, but there;'s more about the limitations of shared memory and register count. The less you use, the more groups can be executed on a CU ('occupancy'). If one group gets stalled by waiting on memory access, another group can be processed in the meantime (like hyperthreading). This is important to hide memory latency, so try to limit register and LDS usage and use a profiling tool to show those things.
3 hours ago, NikiTo said:the wave line count of 64 means that I should try to NOT dispatch less than 64 threads in a group.
Yes, other rhreads would go idle. But NV / Intel use 32 / 8 threads, so it makes sense there.