The bus is not the limiting factor, not even remotely close. First of all, there are maximum bandwidths of the CPU, GPU, and RAM. Second, there's the question of who is actually doing the transfer and when.
This!
Doing a transfer over PCIe is very much like reading data from disk. Once it actually happens, even a slow disk delivers over 100MB/s, but it takes some 8-10 milliseconds before the head has even moved to the correct track and the platter has spun far enough for the sector to be read.
Very similarly, the actual PCIe transfer happens with stunning speed, once it happens. But it may be an eternity before the GPU is switched from "render" to "transfer". Some GPUs can do both at the same time, but not all, and some have two controllers for simultaneous up/down transfers. Nvidia in particular did not support transfer during render prior to -- I believe -- Maxwell (could be wrong, could be Kepler?).
Note that PCIe uses the same lanes for data and control on the physical layer, so while a transfer (which is uninterruptible) is going on, it is even
impossible to switch the GPU to something different. Plus, there is a non-trivial control-flow and transaction-control protocol in place. Which, of course, adds some latency.
So, it is very possible that a transfer operation does "nothing" for quite some time, and then suddenly happens blazingly fast, with a speed almost rivalling memcpy.
In addition to that, using GetTickCount for something in the single-digit (or less) millisecond range is somewhat bound to fail anyway.