If you're learning, start single-thread. If you're building something big, though, start multi-thread, because retrofitting multi-thread is really hard and likely to result in hard to find race conditions.
Software rendering resources and optimization
I developed 3D software engines in the 90's then moved on to DX/OpenGL/Vulkan.
I want to learn 2d/3d software rendering
My problem now is how to optimize this small lib algorithms before going 3D
but Im asking for SIMD and multithreading resources to utilise newer tech and benifit from it in addition to making software renderer
If you want to learn 3d software rendering, don't even think about optimisations at the start. To do a software engine is very difficult, so I would focus on that first. If you ever get it rendering 3D models + materials + lighting + depth buffering + cameras + clipping etc… then you can start profiling the code to see where the bottle necks are.
If you get to the optimisation stage, I would read Michael Abrash's Graphics Programming Black Book ( Quake) before doing anything.
@JoeJ two things told this away. first was, when i switched to a framebuffer code that was far simpler and far less cycles (this was aligned to 32 bit instead of 24 bit bitwise magic). i expected some speed gain from the simpler pixel code, but instead, the speed fell by 15-20% or something, due to larger memory area is being read and written.
of course i have reverted to the original algo after this.
i bought a cpu later, which had 66% less cache, but it was about the same clock speed (2 ghz, 3 mbyte L2 cache, 6 mbyte L3 cache VS 2.1 ghz, 2 MB L2 + 2 MB L3 cache), and architecturally approx the same. the half cache resulted a massive 50% speed drop.
so basically the simplest way to find this out is to buy cpus with various size of cache from the same architecture, or try adjusting the code for bigger/smaller memory pressure vs simpler/more complex code, to see which direction the speed moves, to find out if the cache is a bottleneck or not. for me, i went mostly with the more thight memory representations and hammering the alu a bit more, which gave me the best result in overall in this scenario, in most of the cases.