Hi guys, I'm doing some tests to better understand common bottlenecks on modern GPUs.
I'm taking a look at the valuable Sebastian Aaltonen's performance test tool (https://github.com/sebbbi/perftest).
What I'm try to understand is how those numbers relate with theoretical GPU texture fill-rate throughput.
The first thing I'm having some trouble to understand is how the nominal texture fill-rate should be computed, since usually is just:
Texture Rate = Clock Rate * TMUs.
However, taking as an example the AMD Radeon RX 5700 XT, which is a well documented GPU, this number doesn't match with what I'm reading in RDNA architecture presentation (https://www.amd.com/system/files/documents/rdna-whitepaper.pdf).
The paper states: "[...] the texture mapping unit, which can perform filtering for up to eight texture addresses per clock – again twice the throughput of the prior generation. For each address, the TMU will sample the four nearest neighbors, decompress the data, and perform interpolation. The final texel value is passed back to the SIMD via then response bus".
So from this statement I would say that each TMU can perform up to 8 bilinears samples per clock.
So if we take the number of TMUs listed in the GPU specs (https://www.techpowerup.com/gpu-specs/radeon-rx-5700-xt.c3339), which are 160, it seems legit to compute the texture fill rate as follow:
Texture Rate: 1905 MHz x 160 TMU x 8 (sample/clock) = 2.438 TTexel/s.
But this is clearly not the number provided by AMD which is just 304.8 GTexel/s (1905 MHz x 160 TMU).
A second question is related to the performance test itself.
The performance test mentioned above runs 1024x1024 CS threads, each one reading 256 elements from a different type of buffer, this whole operation is repeated 30 times for a total number of 8 billions reads.
The total time of those reads is collected via GPU queries.
On the RX 5700 XT the random RGBA8 buffer read takes 12.617 ms, which is roughly 332 reads/clock.
Does anyone have any idea how this number could be inferred from the nominal GPU specs and architecture?