Advertisement

Path tracing benchmark

Started by January 25, 2017 01:21 PM
11 comments, last by Infinisearch 7 years, 2 months ago
It took some time after we build up our first coherent traversal kernel for avx512 http://ompf2.com/viewtopic.php?f=3&t=2103
to get a competitive incoherent kernel ready for prime time. Here it is ! For the benchmark we used a full pipeline available easily on any
architecture:
1) camera ray generation, traversal, intersection, shading
2) for any hit: 64 secondary diffuse rays, traversal, intersection and shading
Architectures:
1) Nvidia 1080GTX kernels are based on: "Understanding the Efficiency of Ray Traversal on GPUs"
2) Embree 2.13 avx512 kernels on intel phi 7250 1.4ghz, 68 cores:
Kernels:
- avx512knl::BVH8Triangle4Intersector16HybridMoellerNoFilter for camera rays and
- avx512knl::BVH8Triangle4Intersector1Moeller for secondary diffuse rays
Bvh-Compiler Settings:
- avx512knl::BVH8BuilderFastSpatialSAH
3) Our new avx512 kernels for coherent and incoherent ray transport on intel phi 7250 1.4ghz, 68 cores
incoherentresults.png
Our new kernels clearly outperform any other implementation on all important platforms currently used for path tracing.
Looking at the knl cpu there is another advantage: We can directly connect to high performance networks like infiniband
to scale extremely good compared to gpus in a cluster-like configuration.
At the end we also implemented/migrated our fast bvh compilers http://rapt.technology/data/pssbvh.pdf to avx512.
The compiler timings for all scenes above are:
Fairy: 13.6ms
Runghold: 105.1ms
San Miguel: 359.3ms
Sponza: 7.1ms
The embree compilers in any configuration are far behind these timings on our knl test-system. Therefore we decided not to
publish any numbers here. The sbvh compiler used in "Understanding the Efficiency of Ray Traversal on GPUs" is far away from being
optimized at all.
mp

Quote from the other forum: "to make it short: using coherent ray traversal, knl can render most of the scenes i have around stable below 1 ms into a 1024x1024 frambuffer."

This means only primary rays, yes? We know both raytracing or rasterization is fast enough for that, so how is runtime for 64 secondary diffuse rays?

And some screenshots (or even a video) would be very nice and interesting, pictures are better than numbers :)

Advertisement

Quote from the other forum: "to make it short: using coherent ray traversal, knl can render most of the scenes i have around stable below 1 ms into a 1024x1024 frambuffer."

This means only primary rays, yes? We know both raytracing or rasterization is fast enough for that, so how is runtime for 64 secondary diffuse rays?

And some screenshots (or even a video) would be very nice and interesting, pictures are better than numbers :)

I agree that a few pictures would be nice, to that end I think this is his website: http://rapt.technology/

...so how is runtime for 64 secondary diffuse rays?

And some screenshots (or even a video) would be very nice and interesting, pictures are better than numbers :)

look at the chart above. it is the total runtime for camera rays + 64 random diffuse rays. more than normally used in realtime pt.

video, paper etc. are on the way...

Thanks, the numbers are very impressive. I'm surprised.

FYI, i work on realtime GI and counting my rays (never did before) i get > 300 MRays/s too with a FuryX, but i trace against a tree of discs not triangles and my scene is simpler than Sponza.

Also my algorithm is quite different and we can't compare this, but i'm sure you can squeeze out more from GPUs than your graphs show (the vendor may matter).

Too bad Larrabee did not made it to a consumer product, but probably Intel will hold this back until time is right.

Personally i'd already agree to remove the hardware graphics pipeline from GPUs for more compute power.

Keep us up to date :)

...so how is runtime for 64 secondary diffuse rays?

And some screenshots (or even a video) would be very nice and interesting, pictures are better than numbers :)

look at the chart above. it is the total runtime for camera rays + 64 random diffuse rays. more than normally used in realtime pt.

video, paper etc. are on the way...

Well, let's do some math. Coherent rays doesn't matter for real-life applications, so I'm skipping the 2.5 billion rays / s benchmark. Tho, 1 rays + 64 secondary rays with 2.5 billion leads to 18 fps @ 1080p with primary and secondary sampling alone. No texture sampling, no shader graph execution.

IMHO, only the san miguel scene matters with incoherent rays, everything else is too simple compared to an average game. That's 100M rays/s tops, at 60 fps@1080p, that's 0.8 rays / pixel. Even with temporal anti-aliasing, that's not enough for anything. Besides this, shader execution not only breaks any of your hope for batching, but also is going to be more costly than raytracing itself. In raytracers, tracing rays is usually the cheaper operation, compared to the execution of a shader graph. So let's say, that' halves your ray budget, going down to 0.4 rays/pixel. You also want to do game logic, rebuild bvh for animation, physics etc... Another halving. So we are down to 0.2 rays/pixel on a 5k GPU.

shaken, not stirred

Advertisement

Well, let's do some math. Coherent rays doesn't matter for real-life applications, so I'm skipping the 2.5 billion rays / s benchmark. Tho, 1 rays + 64 secondary rays with 2.5 billion leads to 18 fps @ 1080p with primary and secondary sampling alone. No texture sampling, no shader graph execution. IMHO, only the san miguel scene matters with incoherent rays, everything else is too simple compared to an average game. That's 100M rays/s tops, at 60 fps@1080p, that's 0.8 rays / pixel. Even with temporal anti-aliasing, that's not enough for anything. Besides this, shader execution not only breaks any of your hope for batching, but also is going to be more costly than raytracing itself. In raytracers, tracing rays is usually the cheaper operation, compared to the execution of a shader graph. So let's say, that' halves your ray budget, going down to 0.4 rays/pixel. You also want to do game logic, rebuild bvh for animation, physics etc... Another halving. So we are down to 0.2 rays/pixel on a 5k GPU.

Disagree. That's all valid numbers, but if we change some things it's possible with current hardware:

Raytracing at half resolution is enough. With 300 MRays/s he needs 10 ms for this.

If we do object space lighting and pre transform vertices (which we'll likely do anyways) pixel and vertex shader becomes close to pass through, we can do the same for ray hits:

If we have some LOD of object space lighting for the whole scene and accept diffuse only at the hit point of a reflection ray that's a single texture fetch and to further material dependent execution.

(That's exactly what i'm doing + it's possible but not yet sure for 60 fps at consoles. My results are low frequency but have infinite bounces, they have some lag but are temporal stable.

I'll also add 4x4 enviroment maps to remove the diffuse only limitation - with this there is a fallback for everything so rays can be shortened, getting closer to coherent ray performance)

That's personally biased, debatable and skips a lot of details, but i think we would get down to a maximum of 4 ms for raytracing (+ another 4 ms for object space lighting). Still time to upscale and post process.

Now if we do temporal aliasing and use 16 instead 64 rays - that's 1 ms for tracing, so we don't necessarily need a Phi)

Rebuilding BVH is not worth to mention, it's a preprocess. For animation we can easily refit bounding volumes and rebuild just the top levels of the tree.

Game logic and physics will stay on CPU as usual and has no effect on rendering.

If no triangle transforming is required, optimized and pre-generated data-structures can be used.

Furthermore you can tighten the opening angle from full hemisphere to something smaller. Both will give

you quite another boost on top. using a ray-budget distribution of 4x16 will increase the quality at the same time.

The benchmark above is just a worst-case brute-force approach to measure incoherent ray transport performance

on different hw architectures. Using 4 of the intel knls should give us the possibility to display some scenes in photo-realistic

quality what was not possible before.

mp

just a quick update: we did some tests on intels knights mill. to make it short: the machine is boring.

no perf. progress at all. for graphics intel seems to be a dead end. gpus will dominate the next years.

so all the cpu stuff was wasted time.

 

mp

10 hours ago, mpeterson said:

just a quick update: we did some tests on intels knights mill. to make it short: the machine is boring.

no perf. progress at all. for graphics intel seems to be a dead end. gpus will dominate the next years.

so all the cpu stuff was wasted time.

 

mp

Just so you know intel is making there own GPU division now. (they hired Raja Koduri, former head of RTG(Radeon Technology Group)) of AMD).  Also IIRC they canceled the next Xenon Phi, but the one after that is still on the roadmap.

-potential energy is easily made kinetic-

This topic is closed to new replies.

Advertisement