On 10/21/2018 at 11:16 PM, JoeJ said:
The 4 cycles per instruction should be correct, because one GCN SIMD is 16 lanes wide, and a wavefront is executed in 4 alternating steps. But AFAIK this has no effect on how we should program.
Indeed, we shouldn't think about this.
To understand the parallelism on the AMD GCN CU SIMDs, it helped me realising that there's a fixed number of registers (let's say 256 of 32-bit 64-wide VGPR) and all wavefronts executing on one SIMD (the smallest unit) occupy a fixed portion of those. So if your shader needs 60 VGPRs, the GPU can schedule 256/60=4 64-thread wavefronts in parallel (out of the maximum of 10). That means that WF0 will occupy registers 0-59, WF1 registers 60-129, ... WF3 registers 180-239. Registers 240-255 will remain unused unless a different shader's wavefronts are ready to squeeze in in parallel. To each individual WF, the registers locally look like VGPR0-59 (each one worth 32-bits of memory, fitting 1 float or int, for example).
An important observation is that the wavefronts don't get swapped in or out of the register array. At each clock step, only one of the 4 scheduled wavefronts (in our 60 VGPR case) actually executes instructions. Put differently, they don't really run in parallel on that SIMD. However as soon as WF0 hits a memory instruction which is gonna take many hundreds of cycles, it will be paused and WF1 can start executing its instructions. That's the parallelism latency hiding. Other CUs' SIMDs, of course, run in parallel to this and have their own register arrays. On one of the consoles, there's for example 2 shader engines, each SE has 9 compute units, each CU has 4 SIMDs and each SIMD can schedule up to 10 WFs. So the GPU is actually executing only 2*9*4=72 instructions at each clock step (however each of the 72 instructions on the GPU is run for 64 threads at once!), however up to 720 wavefronts can be in flight, because many memory requests are also in flight in parallel. If I made an error with the numbers, please excuse me ![:) :)](https://uploads.gamedev.net/emoticons/smile.png)
If you're bandwidth-bound, your overall speed will depend on the amount of data and the ALU is almost for free. But if you add many more expensive ALU, actual execution of the instructions of different WFs can get serialised (depending on the scheduler) because each WF has stuff to do instead of waiting for the memory.
The above concerns individual instructions. Scheduling of the individual wavefronts is handled by a circuit (shader processor input, SPI) which has some maximum throughput and is shared between the CUs. That means that it isn't able to schedule a new WF every clock cycle but I believe this isn't usually a bottleneck.
I wanted to post a PDF with the details but I can't find it ![:D :D](https://uploads.gamedev.net/emoticons/biggrin.png)