Advertisement

What are possible causes of a hung Direct3D 11 device?

Started by July 27, 2017 07:42 PM
10 comments, last by Holy Fuzz 7 years, 6 months ago

I am working on a game (shameless plug: Cosmoteer) that is written in a custom game engine on top of Direct3D 11. (It's written in C# using SharpDX, though I think that's immaterial to the problem at hand.)

The problem I'm having is that a small but understandably-frustrated percentage of my players (about 1.5% of about 10K players/day) are getting frequent device hangs. Specifically, the call to IDXGISwapChain::Present() is failing with DXGI_ERROR_DEVICE_REMOVED, and calling GetDeviceRemovedReason() returns DXGI_ERROR_DEVICE_HUNG. I'm not ready to dismiss the errors as unsolveable driver issues because these players claim to not be having problems with any other games, and there are more complaints on my own forums about this issue than there are for games with orders of magnitude more players.

My first debugging step was, of course, to turn on the Direct3D debug layer and look for any errors/warnings in the output. Locally, the game runs 100% free of any errors or warnings. (And yes, I verified that I'm actually getting debug output by deliberately causing a warning.) I've also had several players run the game with the debug layer turned on, and they are also 100% free of errors/warnings, except for the actual hung device:


[MessageIdDeviceRemovalProcessAtFault] [Error] [Execution] : ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware).

So something my game is doing is causing the device to hang and the TDR to be triggered for a small percentage of players. The latest update of my game measures the time spent in IDXGISwapChain::Present(), and indeed in every case of a hung device, it spends more than 2 seconds in Present() before returning the error. AFAIK my game isn't doing anything particularly "aggressive" with the display hardware, and logs report that average FPS for the few seconds before the hang is usually 60+.

So now I'm pretty stumped! I have zero clues about what specifically could be causing the hung device for these players, and I can only debug post-mortem since I can't reproduce the issue locally. Are there any additional ways to figure out what could be causing a hung device? Are there any common causes of this?

Here's my remarkably un-interesting Present() call:


SwapChain.Present(_vsyncIn ? 1 : 0, PresentFlags.None);

I'd be happy to share any other code that might be relevant, though I don't myself know what that might be. (And if anyone is feeling especially generous with their time and wants to look at my full code, I can give you read access to my Git repo on Bitbucket.)

Some additional clues and things I've already investigated:

1. The errors happen on all OS'es my game supports (Windows 7, 8, 10, both 32-bit and 64-bit), GPU vendors (Intel, Nvidia, AMD), and driver versions. I've been unable to discern any patterns with the game hanging on specific hardware or drivers.

2. For the most part, the hang seems to happen at random. Some individual players report it crashes in somewhat consistent places (such as on startup or when doing a certain action in the game), but there is no consistency between players.

3. Many players have reported that turning on V-Sync significantly reduces (but does not eliminate) the errors.

4. I have assured that my code never makes calls to the immediate context or DXGI on multiple threads at the same time by wrapping literally every call to the immediate context and DXGI in a mutex region (C# lock statement). (My code *does* sometimes make calls to the immediate context off the main thread to create resources, but these calls are always synchronized with the main thread.) I also tried synchronizing all calls to the D3D device as well, even though that's supposed to be thread-safe. (Which did not solve *this* problem, but did, curiously, fix another crash a few players were having.)

5. The handful of places where my game accesses memory through pointers (it's written in C#, so it's pretty rare to use raw pointers) are done through a special SafePtr that guards against out-of-bounds access and checks to make sure the memory hasn't been deallocated/unmapped. So I'm 99% sure I'm not writing to memory I shouldn't be writing to.

6. None of my shaders use any loops.

Thanks for any clues or insights you can provide. I know there's not a lot to go on here, which is part of my problem. I'm coming to you all because I'm out of ideas for what do investigate next, and I'm hoping someone else here has ideas for possible causes I can investigate.

Thanks again!
 

Usually this problem happens because you have:

  1. Corrupted memory or similar memory error. e.g. setting a dangling pointer as a texture SRV is bad.
  2. Shader being used with uninitialized data (const buffer, tex buffer, vertex buffer, etc)
  3. Infinite loop inside shader. Be it vertex, compute, or pixel shader. This is often caused by uninitialized variables, or variables with very large values or NaNs.
  4. Some very obscure API usage the debug layer didn't catch.

However you ruled out most of these (except point #1 & #4). #1 is debugged the same way you debug any kind of memory corruption (either use a third party tool or override malloc and hook your own sanitizer).

Other causes for these issues are:

  • Out of date drivers. Seriously. This happens very often. Ask for driver version. If it's very old, ask them to update their drivers. This happens more often than you think. GPU problems that magically go away after updating drivers.
  • Overclocked / overheating systems. Simple games will often allow everything in the GPU to run at 100%, something AAA games often don't achieve (because there's usually a huge bottleneck somewhere). It would explain why using VSync helps with the problem.
  • Switchable graphics. Some notebooks may have Intel + NVIDIA GPUs combination (or Intel + AMD, but that's less common) and for some reason the system may have decided to switch the GPU (e.g. battery, thermal throttling).
  • Monitor issues. The user literally detached / unplugged the monitor to which the active GPU was rendering to. More common on laptops and Win 10 tablets.
  • Third party applications. Apps like MSI Afterburner, Plays.tv, Mumble hook themselves to D3D11 to intercept game's calls and either capture video or render overlays on top of it. They can also cause problems. Having a dump of all active processes when the game hung the GPU can be a good way to rule this out. If you find a common third app between a large percentage of these users, ask them to turn it off.

 

1 hour ago, Holy Fuzz said:

4. I have assured that my code never makes calls to the immediate context or DXGI on multiple threads at the same time by wrapping literally every call to the immediate context and DXGI in a mutex region (C# lock statement). (My code *does* sometimes make calls to the immediate context off the main thread to create resources, but these calls are always synchronized with the main thread.) I also tried synchronizing all calls to the D3D device as well, even though that's supposed to be thread-safe. (Which did not solve *this* problem, but did, curiously, fix another crash a few players were having.)
 

Errr, unless you're really really good at multithreading, you shouldn't be making D3D API calls from other threads. It's asking for a lot of problems.

This also explains why VSync diminishes the problem, since you're likely in a race condition and now the access patterns have changed.

IIRC accessing the immediate context from two threads is not allowed, even if protected by a mutex.

Update: It's allowed, but still you're asking for trouble. Also you better be synchronizing your context absolutely perfect.

Advertisement

Matias, that's very helpful info, thanks a lot! It gives me some good next-steps to look into.

 

25 minutes ago, Matias Goldberg said:

Infinite loop inside shader. Be it vertex, compute, or pixel shader. This is often caused by uninitialized variables, or variables with very large values or NaNs.

Can this only happen if I have explicit loops in my shader code, or it can it also happen by passing bad values to an intrinsic function?

 

25 minutes ago, Matias Goldberg said:

Simple games will often allow everything in the GPU to run at 100%, something AAA games often don't achieve (because there's usually a huge bottleneck somewhere). It would explain why using VSync helps with the problem.

I had a similar thought, so I added a by-default 100 FPS limit to my game. No discernible drop in device hangs though.

 

24 minutes ago, Matias Goldberg said:

Switchable graphics. Some notebooks may have Intel + NVIDIA GPUs combination (or Intel + AMD, but that's less common) and for some reason the system may have decided to switch the GPU (e.g. battery, thermal throttling).

Would this commonly cause a hang, or a different DEVICE_REMOVED error?

 

33 minutes ago, Matias Goldberg said:

Errr, unless you're really really good at multithreading, you shouldn't be making D3D API calls from other threads. It's asking for a lot of problems.

 

Is this just because multithreading is hard to get right, or additionally because there could be underlying problems in D3D/drivers that could be causing problems when called from multiple threads even with proper synchronization?

 

Thanks again for your help! Much appreciated.

1 hour ago, Holy Fuzz said:

Can this only happen if I have explicit loops in my shader code, or it can it also happen by passing bad values to an intrinsic function?

I'm not sure I understand by "bad values to an intrinsic function".

As for your loops, if they look like these:


for( int i=0; i<4; ++i )
{
}
                   
//Or this:
#define LOOP_COUNT 4
for( int i=0; i<LOOP_COUNT; ++i )
{
}

It's fine. But if it looks like this:


uniform int myConstValue;

for( int i=0; i<myConstValue; ++i )
{
}

Then the value you pass to myConstValue is potentially dangerous (you better never send an insanely huge value)

1 hour ago, Holy Fuzz said:

Would this commonly cause a hang, or a different DEVICE_REMOVED error?

Normally yes, but lots of things can happen to report the wrong enum (driver bugs, the GPU actually hung while switching)

1 hour ago, Holy Fuzz said:

Is this just because multithreading is hard to get right, or additionally because there could be underlying problems in D3D/drivers that could be causing problems when called from multiple threads even with proper synchronization?

Both. Multithreading is hard to get right. You said you properly put a mutex around the immediate context... but do you really put the mutex on every single usage of the immediate context? Is it also possible the mutex is malfunctioning (i.e. unlocking from a thread without locking it first)?

Additionally, a driver may be reading data from the immediate context and assuming it's fully single threaded so it begins to read the data from a worker thread while you're actually still writing to it from a secondary thread. Technically, this would be a driver bug. It may even be fixed by now, but your user could be running an old driver.

That makes a lot of sense. Thanks again for the reply!

Oh btw on loops:

If your loop is based on equality of floating point, then there could be issues. For example:


for( float x=0; x == 0.3; x += 0.1 )
{
}

May spin forever due to precision issues.

Advertisement
7 hours ago, Holy Fuzz said:

My code *does* sometimes make calls to the immediate context off the main thread to create resources

Resource creation happens on the device, not the context, and the device is already thread safe. If all you're doing is creating resources, you probably don't need to share the immediate context like this.

3 hours ago, Hodgman said:

Resource creation happens on the device, not the context, and the device is already thread safe. If all you're doing is creating resources, you probably don't need to share the immediate context like this.

Yeah, you're right. I think I was thinking of updating resources after creation, though looking over my code again, I don't think I actually ever do that on anywhere but the main thread.

That being said, I fixed another GPU crash some players were experiencing simply by wrapping calls to the device in lock statements, so I'm not convinced that all drivers are as thread-safe as they're supposed to be.

On 7/28/2017 at 4:32 PM, Holy Fuzz said:

That being said, I fixed another GPU crash some players were experiencing simply by wrapping calls to the device in lock statements, so I'm not convinced that all drivers are as thread-safe as they're supposed to be.

Lots of other games rely on the D3D device adhering to its thread safety promises, and would crash if those promises weren't kept.

So either - you've found a thread-safety bug in D3D which only occurs in your game, or, you've got a threading bug elsewhere in your code, and adding extra sync points elsewhere in your program has happened to change the timings just enough that the race is now unlikely to occur (but still lurking as a non-symptomatic bug).

This topic is closed to new replies.

Advertisement