Juliean said:
which completely fucks speculative execution and prefetching.
Which raises the question what's more fucked: Current CPUs relying on speculative execution (in other words: accelerating just trivial work, but wasting power on executing stuff which isn't used), or computer science aiming to reduce work to what's actually needed? That's really why i like GPUs better. Having more cores makes more sense than trying to increase single core perf. with speculative execution and prefetch.
Juliean said:
we might be talking about a constant factor of up to “times 300
It's not that bad. You miss the cache line, but not the entire cache hierarchy. Just try to keep your random access addresses close and minimize big jumps.
And with hyper threading the CPU can switch to the other thread while waiting on the memory access. (At least i assume CPUs do this as well, not just GPUs.)
Finally, if you manage to have sequential access for a big workload, at some point caches become pointless and main memory bandwidth becomes the limit, which is really tight on CPU.
One and a half core is enough to saturate BW on my machine, so with such workload we negate the multi core advantage as well. That's far from efficient.
The conclusion is: Minimize memory access, increase ALU. And both those goals lead away from brute force processing.
Juliean said:
But you are essentially using a list for the node-elements, because you want to avoid allocations and copying objects inside vectors, right?
Yes, at least for my example about grid cells pointing to lists of objects. The primary goals are to avoid O(n^2) cost fr collision detection, and avoiding dynamic allocations.
I'm not sure if dynamic allocations are still that costly today, but >10 years ago it was very bad, causing unpredictable runtime spikes. Plus, on GPU we do not even have this option at all. So i have learned to design around dynamic allocation, and i stick at that.
Remember that back then using stl in games was just a no go in general, for reasons like this.
If it works well enough today, then this can only be a result of two reasons: OS being faster with allocation, or devs being sloppy with optimization, giving us 2000$ GPUs and making Jensen very happy.
And to justify the sloppyness, you bring up smart reasoning, like talking about spec.exec. and prefetch. But i don't buy it. That's just the usual game dev slang to say ‘i don't have time to properly optimize every little bit.’ hihihi >:)
Juliean said:
Of course (I might sound like a broken records), this needs to be benchmarked
Yeah, but to do that, we need to implement both (actually much more than just two) methods so we can compare them.
But we don't have time for that. So we trust our belly feeling instead, which is product of former, limited experience.
Neither of us can be sure we picked the best option. If it's fast enough, we're happy and move on. Maybe, sometime later it turns out we need to optimize harder, but ideally not.
It's quite hard to tell how much self confirmation bias is involved, and hard facts like knowledge about HW or benchmarks do not help much against that.
Juliean said:
I did include this notion from the start
Yeah, you did. I skimmed some earlier posts to check that. So i have to apologize.
Premature optimization is a problem too, so my accusation of ‘cultivating brute force’ wasn't justified, i see.
Juliean said:
I do wholeheartedly disagree with this as well. Simplicty should always be first, unless time-complexity is really needed, which should be a) absolutely be know, and b) measured.
I have to agree to this as well ofc. But as said, to measure and know, we need multiple implementations.
Before doing this work, it makes sense to ask others first. Often, the impression you get is convincing, so you can spare the double work.
Juliean said:
Writing O(N^2) collision-detection should cost you the fraction of the time as creating a tree
Yep. My usual proposal is to start with a regular grid. Complexity is maybe at 10% between O(n^2) and using a tree. It's still simple. And if you need some quadtree because world is large, the former work on the grid isn't lost but makes the extension much easier.
I make such grid approach quite often, not for collision detection, but to find things spatially for various reasons.
Before i do this, i work on the other things first to see the entire algorithm works. And i iterate all objects for that purpose. After that, if the iteration turns out a bottleneck (very often the case), i make some trivial regular grid.
A need to use an actual tree arises only very rarely. If so, it's often a core concept of the entire work, so i make a specific system tailored to the problem, which usually is a long time task.
I also have some generic BVH class, which i could use for anything. But actually i think i've used it only two times til yet. It's not really great either, just good enough for low requirements.
That just said to show i'm not a tree fetishist or something like that. : )
Juliean said:
Especially occlussion-culling always appears a bit “shizoid” to me - you render the geometry so you know what geometry not to render. K?
Hehe, yeah. Occlusion culling is a treasure chest if you look for bad algorithms or pointless HW features. : )
But it's a really hard problem. It was the holy grail in times of software rendering, and people were very creative about finding solutions. I worked on this too quite a bit.
But til yet, no general solution has been found. And raw GPU power allowed us to postpone the problem. It even forced us to do so, because the basic solution: Render front to back, and store information about occluded regions of screen, can't be implemented on early fixed function GPUs.
I often say: Visibility and LOD are the key problems of computer graphics. Interstingly, HW acceleration blocked us from addressing them. Early Rasterization HW prevented efficient occlusion culling methods, and now Raytracing HW prevents any form of fine grained LOD, including Nanite.
That's true. We could discuss it, but it holds true. Primary reason i rant so much about against the wide spread assumption GPUs gave us all graphics progress. People ignore all the limitations it has caused, and still does.
Juliean said:
So nanite from my understand is a solution that mostly solves those problems. You have all your data on the GPU, you don't need no more culling, etc…
I would say we had this long before, under the umbrella term of 'GPU driven rendering'.
The Nanite renderer does culling in a similar way UE4 already did before, so there is not much progress about this. The software rasterizer also isn't that interesting, imo.
The remarkable feature is fine grained (almost continuous) LOD, resolving the popping problem we have with former discrete LOD solutions. It's like having mip-maps for geometry, selecting only the resolution you actually need.
Because it's fine grained, it's also much more efficient to process and less memory is needed.
To select a current level of detail, some hierarchical data structure is needed, actually a BVH, which they also use for culling.
But there is no need for actual traversal. Instead you can just loop over the current selected LOD cut, and for each cluster decide if you want to increase, decrease, or keep current detail.
Nothing of that is actually a new invention. Even the smart way to avoid a need for stitching was propose in a much older paper, i have learned. And we had cntinuous LOD in soem gems before, e.g. Messiah from the year 2000.
But still, nobody expected what amount of detail this enables. I didn't either, although i work on LOD myself. I always thought LOD is useful to increase efficiency, but i did not think it can be used to give such insane details. :D
Juliean said:
We don't really know all the trials that have went into creating nanite, how much didn't pan out, how had to be discareded, redone etc…
We do. Karis gave a talk about that, similar to the very nice one about Dreams. The devs explain all the things they tried, but did not work out.
That's really nice to read, and actually more inspiring than papers about actual results and progress.
I wish, the path of failure woudl be more of a topic more often. It makes me feel less stupid. :D
Juliean said:
Using an optimized algorithm often can make your code extremely more complex, to the point where you can't even ensure that it is “good” anymore (in terms of reusability, coupling, what have you…).
Absolutely, yes. The price is high.
But for me that's every day practice and normal. It turned out i'm more the research guy than the ‘make a game’ guy.
Now that you said this, i realize my programming experience is quite different from the norm, i guess? Probably so. Might explain some things, hmmm….