Advertisement

Which is faster

Started by October 27, 2024 08:39 AM
8 comments, last by frob 2 months, 2 weeks ago

Which takes less CPU time? Computing distance using the Pitagora theorem or making a point vs square collision detection check. The later is four comparisons the former is two times multiplication one addition and one time extracting the square root.

My project`s facebook page is “DreamLand Page”

This is something that needs to be benchmarked/measured. CPUs today are too complex to make general assumptions.

However, one thing about distance-calculation is that taking the square-root in itself can be pretty slow, not as slow as division, but pretty similar. So one optimization for comparing distances is not taking the square-root, but only calculating the squared distance and comparing those:

float Vector3::SquaredLength(void) const noexcept
{
   return x*x + y*y + z*z;
}

constexpr float RANGE = 100.0f;
constexpr float SQUARED_RANGE = RANGE * RANGE;

const bool inRange = vDistance.SquaredLength() <= SQUARED_RANGE;
// is equivalent to:
const bool inRange = vDistance.Length() <= RANGE;

This can be used as long as you only need to know whether a distance is below or above a certain value, and don't need to do any % calculation, or need to know the absolute value. Then, the work the CPU has to do is just some multiply and adds.

Advertisement

You would need to measure your code to see exactly what it is doing. The square root itself is slow but it doesn't necessarily show up in profiling any more. Don't do the square root if you don't need it, it's extra processing, but the impact is less than many years ago.

With the out-of-order core the measurements will be more complicated to understand, since the core can potentially do 13+ operations per cycle, it will depend on what other code is around the math.

Square root is a much more complex operation and internally takes several cycles to complete. Because of the complexity and the rarity of the instruction only one execution port has the hardware. The multiply is simpler and the comparison is simpler still. I would have to check and verify, but I think it was 3 ports that could do both, could be 3 and 2 though, so each is more likely to move through the out-of-order core more quickly.

Operations are retired in order, so depending on the context of what code is around it and what the ALU is doing internally at the time, code execution often looks like many operations are “free” because they are done instantly when the slow operation is finished. The processor is basically able to do a bunch of easy stuff while waiting for the slow operation to finish.

The result is that even slow instructions can have minimal stalling of the processor, that's the superpower of modern processors. Because the core isn't really blocked, only one of the execution ports is busy and the rest are able to do other work, the time gets masked through the magic of parallel processing.

I see, it's a tricky problem. It beats my level of understanding. What I do understand is that I need to use a timer to know for sure which is faster. Thank you for your feedback.

My project`s facebook page is “DreamLand Page”

The tool you want is a profiler, it's not something you quickly write up on your own.

That's extra important if it “beats your level of understanding”. There is an awful lot to implement to get meaningful numbers. Something you do yourself is more likely to discover details of the operating system like the system scheduler running and pausing your process or the work the system otherwise does managing your work thread than it is to discover actual performance bottlenecks.

Pix for Windows is free, available from Microsoft and probably the most widely used and comprehensive. VTune has a long history and can pull all kinds of useful metrics from Intel chips, but focuses entirely on Intel's systems. Orbit Profiler is an up-and-coming that includes a lot of options and details. Unreal has built-in profiling tools that are rather comprehensive, if you're using that engine. There have been many other good ones through history but some like AMD's tools are no longer maintained.

To understand it you, need to understand how modern processors work. Trying to benchmark what a single instruction does by itself has been meaningless since about 1997.

Details depend on which chip you get, but you might be able to go through 9 instructions retired per cycle. Or it might perform just a portion of a single operation, not having any operation marked as retired or “done” at all that tick.

To give numbers, internally a processor can have several hundred micro-operations in progress. The biggest capacity of the Ryzen Zen 5 series chips and the biggest of the Intel 14xxx Ultra chips can hold over 400 operations in process at once in each of their cores. Internally a single clock tick might perform 10 integer operations, 16 floating point operations, and mark a half dozen CPU instruction as “done”.

So while one execution port inside a single CPU might spend 10 or more CPU ticks churning through the square root operation, it can work on all the other code to complete 20 or 30 other instructions. When it finally gets around to declaring the square root is done it can simultaneously mark a bunch of other work also done, effectively done for free because it was hidden in the time spent doing the square root.

It gets even more complex across processes. You might be doing a matrix transformation involving a bunch of sin() and cos() operations, that completely bog down the one execution port that's able to do the floating point transcendental math. But the physical CPU has two sets of decoders, so it can simultaneously be running another thread inside the core that might be processing the inventory in your game, using all the other execution ports in the same core to churn through that work simultaneously. Even though one thread is bottlenecked through transcendental math operations, another thread ON THE SAME CPU can be flying through integer processing as fast as the CPU can load the data. So your primitive timer measuring some operations will not get the big picture view of the larger amounts of work getting done.

That's only one part, but the simple answer is that anything performance related is going to be complicated on modern computers, and the only way to get any idea what is going on is to use a profiler to get the bigger picture.

For a bit of understanding, I wrote this article for the site about a decade ago. The numbers on the individual processors have grown and small details expanded, but the overall journey through the CPU core hasn't changed. Decoders are bigger, reorder buffers are bigger, there are more execution ports, etc., but it can probably help you understand why a single number would be almost meaningless.

That's extra important if it “beats your level of understanding”

Sorry, that was bad wording on my behalf, beyond my level of understanding is what I meant. I`ll keep in mind your observations.

My project`s facebook page is “DreamLand Page”

Advertisement

frob said:
To give numbers, internally a processor can have several hundred micro-operations in progress. The biggest capacity of the Ryzen Zen 5 series chips and the biggest of the Intel 14xxx Ultra chips can hold over 400 operations in process at once in each of their cores.

What do CPUs about registers for those 400 ops?
Is there something like a big register file to hold duplicates of the few registers exposed by the instruction set?

JoeJ said:
Is there something like a big register file to hold duplicates of the few registers exposed by the instruction set?

A CPU already has wastly more registers than the register-set exposes. The registers that the ASM specifies are mostly meaningless, except for those with a set meaning (RSP). When translating to the actual micro-code on the CPU, it will rename registers from the ASM-set (ie. RCX), to whatever is available (ie. CPU Register 56), This is called “register renaming”. It will of course have to make sure that all future references of the same ASM register get the same value, but it will not have to wait for a previous operation to complete, to be able to reuse that same ASM register! Take the following asm code as an example:

	xor         eax,eax
	cmp         byte ptr [rip+StaticC],0
	mov         ecx,1
	cmovne      ecx,eax
	test        cl,cl
	mov         ecx,0Ah
	mov         eax,5
	cmovne      eax,ecx
	mov         dword ptr [rip+StaticI],eax
	ret
	
	// represents the unoptimized version of
    StaticI = (global ? false: true) ? 10 : 5;

Here, we have to distinct operations on “ecx”, one leads to a “test” on the result of a conditional operation. Right after, we store a temporary into RCX. This second part of the operation can already run, before the first completes, as the CPU can detect that RCX is overwritten fully (as 32-bit operations clear the full 64 bit register), thus there is no dependency on the previous value. This will lead to both instances of “ecx” using different internal registers. This is actually pretty nice for compiler-writers like myself, since I do not need to bother with manually assigning different registers to adjacent operations to ensure parallelism… I can just use whatever register uses the smallest encoding and is available, and the CPU will take care of that itself.
In fact, since we do not use branches here, the CPU will be able to run that full code, without waiting for anything to complete, until the last instruction… and I'm not even sure it has to wait there (just assuming it has due to needing to assign to a memory location).

That's the gist of it. Actually, a lot of assembly-code is being held back by too little registers being available compared to what CPUs have. Outside of temporaries, it would be nice to have more locals fit into registers. That's why Intel added an APX-extension (not to be confused with AVX) two years ago, which exposes an additional 16 registers to the x86_64 instruction-set. Only downside is it will take a while until this becomes mainstream, so unless you have a JIT-compiler, you cannot rely on that.

JoeJ said:
What do CPUs about registers for those 400 ops? Is there something like a big register file to hold duplicates of the few registers exposed by the instruction set?

Yes, that's a part of it. It's covered briefly in the article I wrote a decade ago linked above.

The short form is that the 16 registers in the instruction set (ax, bx, cx, dx, sp, si, and so on) go through a renaming step for internal processing. Internally the renamed variables can be processed in any order, and they can be mapped to the value after completed operations, like “this register is the value of rax after the result of instruction 17”. The register renaming allows all kinds of improvements for dynamic scheduling, giving a massive boost to computer performance. Out-of-order execution is a big one, there are a bunch of execution ports inside the CPU and they all do different things; a execution port can see an operation that is five or ten instructions later but is ready to be worked on immediately, so it can work on it early and out of order. Another is speculative execution, if there's a condition if (x) { } else { } then execution ports can run both branches whenever it gets time, and once it finally knows what x is it can keep one set and discard the other. Another benefit of the huge buffer allows for prefetch and for caching of known results as well since the pipeline is so long; by the time it gets in line to actually compute the value, the memory address will be fully loaded.

By mapping a bunch of internal memory for internal CPU variables to what the instruction set sees as the 16 registers, every CPU can have it's own internal optimizations to make code run faster without modifying the program. It accounts for budget, those who can afford expensive computers can take advantage of all the extra internal CPU memory, those on a bare-bones processor get theirs without the performance boost but still run the same program. Similarly a low-power chip and a high power chip, both from the same program. It allows Intel and AMD and ARM and other vendors to make their own competing chips with different internal optimization choices. The program written 10 years ago can see speedups on a newer processor thanks to an optimization applied dynamically.

Here is the Wikipedia page with a pretty good and reasonably brief description of how it works, it gets complicated quickly. It has links to deeper articles if you want to dig deeper into CPU architecture.

This topic is closed to new replies.

Advertisement