Calin said:
It doesn`t confuse me I think
. It`s useful to have a broad idea about what`s hiding behind the `pipelining`
word,(introduction
level type of knowledge)
It's been about 7 years since I put it together, but this article should explain much of it.
From a high level viewpoint you can treat the x86 processors of today exactly the same as they were treated in 1978. Conceptually you can treat it as reading one instruction at a time, then stepping through to the next instruction, completing them one after another.
What happens inside the box has advanced tremendously over the past 40 years. A single CPU has two faces, called Hyper-Threading or Simultaneous Multithreading depending on if you use Intel's or AMD's branding, and can decode up to 8 instructions per cycle, four per interface. Internal pipelines have grown tremendously long, so an instruction can sit inside the box for ages.
Some minor updates from that article: Internally the current round of processor can start up to 6 micro-ops per clock cycle shared between two HT/SMT front ends, and retire up to 4 micro-ops per cycle. Sometimes instructions finish early, so in theory after a bottleneck it can appear as though up to 14 assembly instructions all complete in a single cycle.
From a black box perspective, that means up to 4 instructions consumed in a single cycle, up to 14 instructions completed in a cycle. The black box always treats instructions sequentially, running in the same observable ordering as the instructions ran four decades ago.
Clock cycle timings today mean something of different importance than what they meant 40 years ago, and have transitioned over time. Right now about 5GHz is the physical maximum. When the timer signals a clock tick, the electric signal barely has time to spread across the chip before a new clock tick. If you could see it as a wave spreading across the entire chip, the wave would only be about ⅔ of the way across the big multicore chip before the next signal is sent. Some instructions take several CPU cycles because that is how long it takes for the signal to physically be sent across the chip. Coupled with the effects of the out-of-order core, instruction timings and pipeline flows aren't nearly as useful to understand as they once were.
Two and three decades ago tremendous efforts were spent organizing code to minimize pipeline bubbles. CPU instruction timings were important for optimizers. These days in general the amortized cost is zero over the pipeline, with the bottlenecks coming from keeping the caches loaded rather than execution time. Instead of waiting for instructions to finish, our big bottlenecks today are keeping the CPU fed with data and instructions. The biggest step between CPUs for about 15 years has not been CPU speeds, but instead the sizes of on-die caches, with L1, L2, now massive L3 sizes. Some of today's premium chips have 64MB of L3 cache on die, to help keep the CPU fed with data.