Writing portable Vulkan

Graphics and GPU Programming Programming Vulkan

Started by swiftcoder August 14, 2018 04:40 AM

5 comments, last by swiftcoder 6 years, 5 months ago

Author

18,997

August 14, 2018 04:40 AM

Folks continue to tell me that Vulkan is a viable general-purpose replacement for OpenGL, and with the impending demise of OpenGL on Mac/iOS, I figured it's time to take the plunge... On the surface this looks like going back to the pain of old-school AMD/NVidia/Intel OpenGL driver hell.

What I'm trying to get a grasp of is where the major portability pitfalls are up front, and what my hardware test matrix is going to be like...

The validation layers seem useful. Do they work sufficiently well that a program which validates is guaranteed to at least run on another vendor's drivers (setting aside performance differences)?

I assume I'm going to need to abstract across the various queue configurations? i.e. single queue on Intel, graphics+transfer on NVidia, graphics+compute+transfer on AMD? That seems fairly straightforward to wrap with a framegraph and bake the framegraph down to the available queues.

Memory allocation seems like it's going to be a pain. Obviously theres some big memory scaling knobs like render target resolution, enabling/disabling post-process effects, asset resolution, etc. But at some point someone has to play tetris into the available memory on a particular GPU, and I don't really want to shove that off into manual configuration. Any pointers for techniques to deal with this in a sane manner?

Any other major pitfalls I'm liable to run into when trying to write Vulkan code to run across multiple vendors/hardware targets?

As for test matrix, I assume I'm going to need to test one each of recent AMD/NVidia/Intel, plus MoltenVK for Apple. Are the differences between subsequent architectures large enough that I need to test multiple different generations of cards from the same vendor? How bad is the driver situation in the android chipset space?

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

SibylSystem

385

August 14, 2018 05:29 AM

I don't have any advice, but I'm messing around with Vulkan as well and would be interested in what you are asking about.

JoeJ

4,399

August 14, 2018 07:14 AM

2 hours ago, swiftcoder said:
Are the differences between subsequent architectures large enough that I need to test multiple different generations of cards from the same vendor?

I have texted on NV 670, 1070 and AMD 7850 and FuryX. No issues, but i have only simple debug visuals and mainly compute. Validation is really helpful - it's harder to make this happy than the hardware. (Hardware is more forgiving than validation.)

To me it seems much better than with OpenGL where i often had issues. I never want to use GL again after the move. But as said i have not yet started work on the renderer, so i can't say much.

2 hours ago, swiftcoder said:
I assume I'm going to need to abstract across the various queue configurations? i.e. single queue on Intel, graphics+transfer on NVidia, graphics+compute+transfer on AMD?

Yes. Did not figure out details elsewhere than on AMD, but without doubt that graph has to be adjusted for various hardware, likely even different configurations for large or small AMD GPUs.

Hodgman

52,718

August 14, 2018 08:05 AM

I'm focussed on Dx12 over Vulkan, but it's a lot of the same challenges.

If you don't use the compute queue, you're in exactly the same boat as you were with D11/GL. It's an amazing new optimisation opportunity, but old drivers don't use it at all, so if you don't use it (yet) then you're not really missing out.

The transfer queue is most useful for creating buffers/textures. It's the only way to actually saturate a PCIe bus. Make a code path that uploads textures buffers using this method, then make a fall back that uses the CPU to copy the data (for on board / Intel GPUs). In D12, Intel does actually expose a DMA queue which is backed by actual hardware, but it's slower than the CPU copy implementation. It's only useful when you really want free background copies and are OK with huge latency.

Memory management is a pain, but ultimately a better situation than GL/D11. If you went over your budget in the old APIs, the drivers would just silently insert 40ms-long paging operations in the middle of your frames, dropping you off a performance cliff with no notification. Worse, there was no way to query your application's memory budget on old APIs, which ironically made memory scaling more important.

You're in the same boat now, but without the automatic "let's break interactivity and pretend everything is fine" auto paging built in.

There's kind of two strategies you can choose, or do both... (1) query your app's memory budget and try as best as you can to stay under it. When loading a level, load all the descriptions of your buffers / textures first, then as a large batch operation, determine the memory requirements, then if that would put you over then drop the LOD on some resources (e.g. What if you didn't load the top mip level) and recalculate. Repeat until you've got a configuration that's in budget, and then start streaming in that set of assets. (2) each frame a resource is used, mark the heap that it's allocated within as being required on this frame number. If you hit an out of memory condition, try to find a heap that has not been used for a while and transfer it to system memory. Having relocatable resources is a massive pain in the ass though, so I'd focus on doing #1 more than #2, as the second option destroys performance just to keep limping along anyway... (just like a D11/GL memory oversubscription event).

Unfortunately, application memory budgets are dynamic (e.g. The user might start your game after a fresh reboot, then alt+tab and open 100 Chrome tabs...) so you ideally you should respond to changing situations... Alternatively you can just assume that users won't go opening other GPU hungry apps while their game is running...

. 22 Racing Series .

_the_phantom_

11,263

August 14, 2018 09:08 AM

*lumbers in to life for the first time in a while*

On memory; while it is a bit of a pain it also has some advantages in that you can do something you previously couldn't do - alias memory. Granted, this needs a frame graph to sort out memory life times but it does mean that you can control the memory footprint a bit better (EA's frame graph presentation from GDC '17, iirc, has some good numbers). If you can't be bothered to deal with memory pain up front then AMD's memory allocation library is a good place to start ( https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator ) - the main thing you have to remember is that actual allocations are limited, I've not seen support for more than 4096 in the wild, so you'll be block allocating and then sub-allocating from that block. The main wrinkle I know about in this area is that NV seem to prefer that their render targets are allocated separately (see 'VK_KHR_dedicated_allocation', which might have been rolled in to the 1.1 spec; AMD lib above will use it afaik) and they have a fair few rules on alignments of types for buffers and images. AMD on the other hand seem not to care, often reporting very low alignment requirements so be careful with that when trying to sub-allocate/work out block sizes (someone I was helping out was basing block sizes for buffers on alignment information, his NV card gave him a large-ish number so things work, where as AMD reported back '1' which caused him to allocate a new block per allocation). It might be worth looking up the various numbers on https://vulkan.gpuinfo.org/ as a reference.

The other advice I would give is - test on AMD as this is the only way to be sure you've not screwed up barriers/transitions.

Put simply AMD require you to get this right; get it wrong and you'll luck out with it working at worst, corrupt or crash at best.

NV, on the other hand, ignore all barrier function calls. Totally. They have their own low level state tracking which they apply so your barrier calls can be doing all manner of crazy incorrect things and it'll just work on their hardware/drivers (driver bugs not withstanding which is just a massive ?‍♂️ given the whole point of the new APIs... ).

Basically NV is not a good test platform for correctness.