JoeJ said:
Agreed, but which are the problems that affect you?
Dynamic geometries and acceleration structures (even when we talk about something like nanite) … my most naive implementation some time back used (in compute) HLBVH rebuilt each frame. Since then I've moved to pre-built trees for clusters and assembling BLAS from those. It's not miraculous and would need a lot more time invested from my side to improve (which I'll try to do Q1 2025)
HWRT is just as the naive implementation (yes you can group clusters together to make it “a little better” and not get too much punished by alignment costs, etc). But it still sucks. Vendor locked solutions might come (NVidia already presented one at CES - but honestly that's useless and bad … vendor locked and I assume only for UE5 … not to mention UE5 performance is terrible everywhere) … adopting such to core? The party (NVidia) guilty the most for fixed function in ray tracers is trying to fix the problems by introducing more fixed functions? That's like extinguishing fire by pouring tanks of gasoline into it.
Caching acceleration structure is a joke (you can't do that in any reasonable way).
Intersectors on NVidia gpus are on tensor cores (ray-triangle) and are meh. Try custom intersection code … now you're literally f***ed.
Is it usable? To an extent yes. Is it that much faster compared to compute? No, not really.
JoeJ said:
Maybe it would be better to write one backend per vendor, than trying to have crossvendor APIs for the price of compromised performance and redundant complexity.
This would become a nightmare real fast. Especially when you add new GPU vendors into the mix (the ones from China like Moore Threads). They often keep documentation in Chinese only … CUDA is already a disaster in which everyone is caught (which also counts everyone remotely related to hallucinations … er .… "ai").
Vendor-locking works only if there is no monopoly. If there is any monopoly - that means to literally 0 progress. Have you seen what NVidia presented at CES? 5xxx generation brings almost no actual improvement (hallucinating 3 frames instead of 1 (both is bad) and advertising it as “better performance” is just laughable; they could do the same with dlss in 4xxx - but ofc that will be locked out as they have to sell that minimal improvement in hardware). And of course the price shift upwards (everyone expected that as they are monopoly (>80% share) in many areas now). I wouldn't say HW progress is dead - NV doesn't need to try to sell - so they don't. Add few more tensor cores to make text generators happy, 40% price increase and we're selling!
Their main competitors? AMD are lunatics as usual - who spent more time in renaming their GPU series, than actually fixing the bugs they have (I'm ranting about their wrong amplification shader implementation in D3D12, where nothing is passed into indirect dispatch of those!). Intel hasn't still recovered from melting their own CPUs (although I have to admit Arc is finally becoming something now).
Speaking of Intel - the "nanite" BVH idea I'm playing with is similar to what they have tried in embree for similar dyn lod meshes. They made a paper about it few years back.
JoeJ said:
It's not enough to just complain about missing flexibility, we need to make specific points and proposals i think. Might happen behind closed doors, but ther should be public discussion as well imo.
Our solution is about as crazy as you can think - dual support of compute and hwrt. Sometimes hwrt wins in performance, sometimes loses. The generic gpu programming tools at this point are enough for us to do everything in compute we need. Addition of work graphs is nice and great.
Of course upkeeping the code is a bit harder due to more classes/code … but it's nothing terrible.
Proposing any solution? It's hard - because from experience various scenarios require different solutions/approaches. No one was rushing to generalize that because it is very hard.
JoeJ said:
5 years have passed, Indiana Jones is the first game requiring HWRT, but there were no improvements on the API side at all.
Players have not adopted the hardware, and requirements of games skyrocketed. UE5 games are incapable of running 4K@60fps on top-end hardware (and we're talking 2.5k+ EUR GPU like RTX 5090). Hallucinating frames is bad (latency issues, temporal smearing, etc.) … and you can see that in sales of those games (there are exceptions like Wukong).
You can't reasonably add HWRT at scale into this.
Also I'm not sure whether HWRT and DLSS4 won't bump into each other - tensor cores used for intersections are the same ones used for hallucinating data. This could be a problem.
JoeJ said:
Currently i think the fix will only come when DX12 and VK get phased out and replaced by new APIs again, which is something that happens only each 20-30 years it seems. I'll be dead til then, so using HWRT inefficiently seems the only way to use it at all.
Hold in there buddy! You're still young.
EDIT: Sorry if this sounds as quite a rant - wrote that while messing with some vendor-specific branches in code, which is my favorite thing…