Advertisement

Are these ancient optimizations still relevant?

Started by December 28, 2020 10:02 PM
3 comments, last by ddlox 3 years, 11 months ago

I'm in the process of writing a software renderer raycaster. I've actually done this before a very long time ago on a 386. Back in the day, in order to get a good frame rate, I had to incorporate several programming tricks, and I'm wondering if they are still relevant. Even if they are, do modern compilers simply do it for me, and what about inline assembly, does that also get reordered?

  1. Alternating 32 and 16 bit instructions. The 386 introduced 32 bit register, but maintained backwards compatibility by having the lower half act as 16 bit registers. For multi clock cycle instructions, the CPU could overlap them, so you'd write code to go something like ADD EAX EBX, SUB CX DX…
  2. Alternating int and float math. Most machines at the time did floating point math in a coprocessor, which ran in parallel with the main CPU. By alternating 2 int, 1 float, 2 int, 1 float, you'd get essentially parallel processing.
  3. Jumping on less likely branch. The compiler I used would turn if(…)then(x)else(y) into cmp,jne,x… So you'd always make the most likely code to run away for the else

With the improvements of branch prediction and pipelining in the last several generations of chips, are any of these still relevant?

Wallace121215@gmail.com said:
are any of these still relevant?

None of those are specifically still relevant, but there are other considerations.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Advertisement

Those specific optimizations are no longer relevant, but the hardware effects and potential optimizations “in the same spirit” are still there:

  1. and 2. The CPUs are now much more pipelined and out-of-order than they were back then. Agner Fog maintains excellent hardware documents that describe the pipelining architectures from different generations. See https://www.agner.org/optimize/microarchitecture.pdf​ . E.g. page 217 states that Zen generation has 4 integer and 4 floating point pipes. However, whether your code runs into a pipelining bottleneck is not at all as obvious and not at all as likely to happen as it used to back in the Pentium U and V pipe days. Rather, data cache line access patterns, maximizing ALU throughput via SIMD, and dynamic branch prediction failures are the main effects you'll be likely to see in a relatively tight code.
  2. The need for that kind of scheming was common when CPUs only had “static branch prediction”, they might e.g. assume that jumps backward are always taken, and jumps forward are never taken, and statically predict the outcome from program structure. Modern CPUs employ dynamic branch predictors, i.e. they maintain an internal mapping table “code address” → “% times taken” history, and predict according to the majority vote of the history. This means that if you have a branch that most of the time takes one branch, it will be very cheap. However, if you have a branch that goes 50%-50% either way, those branches will be very costly, and it is useful to avoid them in hot loops whenever you can. This kind of optimization of course implies that your code is already tight (and hot) enough that the effects are observable. Intel's VTune and AMD's uProf can both record branch misprediction %s, that highlight the branches that were slow to predict.

Wallace121215@gmail.com said:
…does that also get reordered?

short answer is yes;

u didn't mention what language u will be coding in… but seeing your short ASM strip, i assume C/C++;

in which case grab the latest book on c/c++ and code away ?

if u hit some slow code section, then profile that section and improve it (if u can, u can even improve it with simd instructions);

anyway, it's a long haul, so be patient with yourself;

all the best ?

This topic is closed to new replies.

Advertisement