VoxycDev said:
bzt said:
Recently I keep seeing this non-sense. Can you prove it? Is there anything to back up that claim?
No, I can't prove it. This was the stack overflow consensus a while back when I contemplated getting into assembly. I really wanted to find an excuse to re-learn it and use it to optimize something cool, but googling this completely discouraged me. I hope you're right and it's not true. As a kid I remember looking through C code and then finding assembly blocks and thinking “crap! black magic… wish I could do this.” What would be a compelling case for doing this today? Since it's CPU-dependent, you'd still have to have fallback C/C++ slow version in case CPU doesn't match. But I'd love to optimize something critical with it such as line of sight or collision or occlusion detection or voxel mesh generation. If I took the machine code for these functions, would there be a way to to humanly do this? I guess I'd have to find the bottlenecks and remove them, right?
No, it's not black magic :-) For the stack overflow consensus, I can only say what Euclid said thousands of years ago: “Never believe anything just because someone said so”. The key is get empiric proof for yourself, do profiling.
About your question, the best way to humanly do this and optimize would be to start with C/C++. Use its optimizer, then disassemble the function. This is the easy part. Then read all the CPU specs, learn all the instructions, read all the manufacturer's spec on the optimization, and finally find faster and more compact solutions replacing the instructions in the disassembly one by one. You'll be surprised how many things can be replaced easily with a more efficient instruction in a -O3 disassembly! But for that first you have to learn which instruction are available for your disposal of course.
About being CPU dependent, that's true. Having a C fallback is not a must, but certainly a desired feature, a good-to-have alternative. Keep in mind that CPUs are getting extinct, so these days there's basically only two CPU family to code for: Intel and ARM. The overall market share of the rest (MIPS, SPARC, PowerPC etc.) is insignificant and it's shrinking rapidly. The good news is, both Intel and ARM has intrinsics interface, so you don't have to go down to Assembly level if you don't want to (although ARM's far less mature than Intel's, but it's getting there).
Juliean said:
Not only does the compiler a lot of neat things that are faster than what you can do by hand
Then why is pixman written in Assembly and with intrinsics if not for speed? Please show me a compiler optimized pure C/C++ version of the linked pixmap-ssse3.c which is faster than the current, hand-optimized version. If you do so, then you'll have a proof and I'll admit I was mistaken. Until then no video can tell you if an actual implementation is better or not!
Juliean said:
In any case, you can use https://godbolt.org/ to if your assembler/Intrisics are really doing anything. After watching the video (and trying some things out on the site), you'll see that this is not as much as you might think.
I don't “might think”, I always measure. I know exactly how to do profiling, thank you. That's why I'm 100% certain that my manually optimized code is far better than any compiler optimized version. I give you that I'm not the general case, I've many decades of experience with low-level and bare metal programming, and an average C++ programmer can't do what I can (and unlike me, they would probably watch the video instead of profiling the actual code ;-) ).
Cheers,
bzt