PCI Express Throughput

Graphics and GPU Programming Programming

Started by maxest December 20, 2016 04:26 PM

16 comments, last by Infinisearch 7 years, 4 months ago

3,058

December 22, 2016 12:20 AM

I can think of two things in regards to the uneven transfer bandwidth.

1. the texture might be in morton order or tiled in some fashion and might have to be untiled first before being transfered.

2. there is some sort of arbitrator that deprioritizes read accesses from the CPU to video memory. But since you aren't doing anything else at the time why would it limit bandwidth?

-potential energy is easily made kinetic-

MJP

20,297

December 22, 2016 06:56 AM

Your benchmark is very suspect, since you're using CPU timers to attempt to record operations that are happening asynchronously on the GPU. I would suggest you do the following:

1. Use D3D12 or Vulkan so that you have more explicit control over where your memory is allocated from and what operations are being performed on the GPU.

2. Use a copy command list to copy data from a linear buffer allocated in device memory to another linear buffer allocated in CPU-accessible memory (or vice-versa). A copy command list will use the DMA engines, which on dedicated GPU's are designed to maximize PCI-e throughput. A linear buffer will let you avoid any overhead from tiled/swizzled texture layouts.

3. Use GPU timestamps to measure the amount of time taken to perform the copy. This will let you measure just the copy time and not the cost of other operations. Alternatively, you can use a tool like GPUView or Nvidia Nsight to get GPU timing data.

The Blog | The Book

Matias Goldberg

9,638

December 22, 2016 03:09 PM

The benchmark you posted is flawed and will stall. Period.

You need to give time between the calls to CopyResource & your Map. I'd suggest using 3 StagingBuffers: call CopyResource( stagingBuffer[frameCount % 3], and then call Map( stagingBuffer[(frameCount + 5) % 3] );

that is, you will be mapping this frame the texture you started copying 2 frames ago.

What you are measuring right now is how long it takes for the CPU to ask the GPU to begin the copy transfer + the tasks that the GPU has pending before the copy + the time it takes for the GPU to transfer the data to CPU (your CopyResource call) + the time it takes for the CPU to copy from CPU to another region in CPU (your memcpy)

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

Promit

13,404

December 22, 2016 06:10 PM

Here (https://en.wikipedia.org/wiki/PCI_Express) is a nice table outlining speeds for PCI Express. I have GF 660 GTX with motherboard with PCI Express 3.0 x16. I made a test by writing a simple D3D11 app that download 1920x1080x32 (8 MB) image from GPU to CPU. The whole operation takes 8 ms. In second this sums up to around 1 GB of data, which corresponds exactly to PCI Express 3.0 x1. Is this how it is supposed to work? Is it like all CopyResource/Map data goes through one of the 16 lanes?

The bus is not the limiting factor, not even remotely close. First of all, there are maximum bandwidths of the CPU, GPU, and RAM. Second, there's the question of who is actually doing the transfer and when. Is it a DMA operation? Is the driver buffering or doing prep work? That sort of thing. Third, 8 MB is a very small copy size to try and benchmark that bus, so I would not consider your timing to be valid in the first place. Fourth, you're using CPU times before initiating and after completing the transfer, you're capturing extra work happening inside the driver that deals with correcting data formats and layouts. Fifth, who said the driver wants to give you maximum bandwidth in the first place? It has other things going on, including the entire WDDM to manage.

That you got a number comparable to one lane is pure coincidence.

Now my tests show upload (CPU->GPU) 8 GB/s and download (GPU->CPU) 3 GB/s.

Again, the bus has jack all to do with these speeds. What are the maximum bandwidths of the respective CPU and GPU memories? Both are DMA transfers, and the GPU may have much more capable DMA hardware than CPU, especially since graphics memory bandwidth is so much higher than system memory. Not to mention you're also capturing internal data format conversions.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

samoth

9,833

December 22, 2016 08:40 PM

The bus is not the limiting factor, not even remotely close. First of all, there are maximum bandwidths of the CPU, GPU, and RAM. Second, there's the question of who is actually doing the transfer and when.

This!

Doing a transfer over PCIe is very much like reading data from disk. Once it actually happens, even a slow disk delivers over 100MB/s, but it takes some 8-10 milliseconds before the head has even moved to the correct track and the platter has spun far enough for the sector to be read.

Very similarly, the actual PCIe transfer happens with stunning speed, once it happens. But it may be an eternity before the GPU is switched from "render" to "transfer". Some GPUs can do both at the same time, but not all, and some have two controllers for simultaneous up/down transfers. Nvidia in particular did not support transfer during render prior to -- I believe -- Maxwell (could be wrong, could be Kepler?).
Note that PCIe uses the same lanes for data and control on the physical layer, so while a transfer (which is uninterruptible) is going on, it is even impossible to switch the GPU to something different. Plus, there is a non-trivial control-flow and transaction-control protocol in place. Which, of course, adds some latency.

So, it is very possible that a transfer operation does "nothing" for quite some time, and then suddenly happens blazingly fast, with a speed almost rivalling memcpy.

In addition to that, using GetTickCount for something in the single-digit (or less) millisecond range is somewhat bound to fail anyway.