Advertisement

ASM program flow

Started by October 16, 2020 04:28 PM
18 comments, last by Gnollrunner 4 years ago

Calin said:

It doesn`t confuse me I think. It`s useful to have a broad idea about what`s hiding behind the `pipelining` word, (introduction level type of knowledge)

It's been about 7 years since I put it together, but this article should explain much of it.

From a high level viewpoint you can treat the x86 processors of today exactly the same as they were treated in 1978. Conceptually you can treat it as reading one instruction at a time, then stepping through to the next instruction, completing them one after another.

What happens inside the box has advanced tremendously over the past 40 years. A single CPU has two faces, called Hyper-Threading or Simultaneous Multithreading depending on if you use Intel's or AMD's branding, and can decode up to 8 instructions per cycle, four per interface. Internal pipelines have grown tremendously long, so an instruction can sit inside the box for ages.

Some minor updates from that article: Internally the current round of processor can start up to 6 micro-ops per clock cycle shared between two HT/SMT front ends, and retire up to 4 micro-ops per cycle. Sometimes instructions finish early, so in theory after a bottleneck it can appear as though up to 14 assembly instructions all complete in a single cycle.

From a black box perspective, that means up to 4 instructions consumed in a single cycle, up to 14 instructions completed in a cycle. The black box always treats instructions sequentially, running in the same observable ordering as the instructions ran four decades ago.

Clock cycle timings today mean something of different importance than what they meant 40 years ago, and have transitioned over time. Right now about 5GHz is the physical maximum. When the timer signals a clock tick, the electric signal barely has time to spread across the chip before a new clock tick. If you could see it as a wave spreading across the entire chip, the wave would only be about ⅔ of the way across the big multicore chip before the next signal is sent. Some instructions take several CPU cycles because that is how long it takes for the signal to physically be sent across the chip. Coupled with the effects of the out-of-order core, instruction timings and pipeline flows aren't nearly as useful to understand as they once were.

Two and three decades ago tremendous efforts were spent organizing code to minimize pipeline bubbles. CPU instruction timings were important for optimizers. These days in general the amortized cost is zero over the pipeline, with the bottlenecks coming from keeping the caches loaded rather than execution time. Instead of waiting for instructions to finish, our big bottlenecks today are keeping the CPU fed with data and instructions. The biggest step between CPUs for about 15 years has not been CPU speeds, but instead the sizes of on-die caches, with L1, L2, now massive L3 sizes. Some of today's premium chips have 64MB of L3 cache on die, to help keep the CPU fed with data.

Thanks. to all of you. Im not looking to split things down to machine code. What you guys told me makes good for starters for someone my level I would guess.

My project`s facebook page is “DreamLand Page”

Advertisement

Just remember the general principle that the CPU pretends to execute code serially, with one program counter and a single set of registers, but it actually does whatever it takes (speculative execution, superscalar execution, pipelining, extra registers, emulation, etc.) to perform more work in less time. If you can tell the difference, it's a serious bug.

Omae Wa Mou Shindeiru

Hey LorenzoGatti Ive noticed youve been posting quite often to this forum, I have nothing against you but dealing with too many `seniors is confusing. I`ve had the chance to meet and engage in exchange of ideas with a lot of awesome people on this forum. Every person I get to interact with becomes a friend. I`m not American, I`m not even from the Western world. I can`t have here as many friends as a Westner could. I don`t mind having a ton of friends but it`s not realistic. I got to know a lot of people but I`m at a point where I can`t add more friends to the those I currently have. If you were some casual person things would have been different

My project`s facebook page is “DreamLand Page”

🙂🙂🙂🙂🙂<←The tone posse, ready for action.

ddlox said:
were executed in parallel at the same time because they did not have dependencies on each other: this process was called pipelining and

Actually that is called superscalar execution and in x86 land was introduced with the original pentium and IIRC the pipelines were referred to as the U and V pipes (i got to look it up its been a while) and there were ‘interlocks’ that had to do with invalid instruction pairings. Abrash's zen of code optimization or is it Black book is floating around the web for perusal and download… You might be able to follow along @calin you should take a look, I think gamedev.net has it or at least a link. Also though I haven't read it myself this might be a good free introduction https://en.wikibooks.org/wiki/X86_Assembly

-potential energy is easily made kinetic-

Advertisement

ddlox said:

then when RISC CPUs were released starting with pentiums (if i remember well),

One minor comment. Pentiums were never realy considered RISC. I worked at intel at the time and there was always talk about the competition between RISC and “our” CISC processors, and who would come out on top. As near as I can tell this distinction isn't there any more anyway.

Gnollrunner said:
One minor comment. Pentiums were never realy considered RISC.

That's because they weren't they were 100% CISC because they implement the x86 instruction set.

Gnollrunner said:
I worked at intel at the time and there was always talk about the competition between RISC and “our” CISC processors, and who would come out on top. As near as I can tell this distinction isn't there any more anyway.

Are you saying because intel doesn't have any non x86 competition anymore? Or are you saying there's no distiction between CISC and RISC, or at least no distiction between their performance or maybe underlying implementation?

-potential energy is easily made kinetic-

@Infinisearch As I'm sure you know, RISC stands for Reduced Instruction Set Computer, but also came with an understanding of some level of pipeline execution and fixed instruction length. CISC stands for Complex Instruction Set Computer. Now CISC processors have a lot of pipe-lining and RISC processors have larger instruction sets (sometimes larger than CISC processors). Some RISC processors even have different length instructions. In any case RISC vs CISC doesn't seem to be a big deal as it once was.

This topic is closed to new replies.

Advertisement