Advertisement

ID3D11Query is counting CPU time after Swapchain present

Started by July 02, 2019 06:38 PM
5 comments, last by Wuszt 5 years, 7 months ago

Hello, my simplified code looks like that:


//Queries initialization
// disjoint0 is D3D11_QUERY_TIMESTAMP_DISJOINT
// queryStart & queryEnd are  D3D11_QUERY_TIMESTAMP

while (true)
{
    m_d3DeviceContext->Begin(disjoint0);

    //UpdatingScene
    //DrawingScene

    m_d3DeviceContext->End(queryStart);

    Sleep(10); //First sleep

    m_swapChain->Present(0, 0);

    m_d3DeviceContext->End(queryEnd);

    m_d3DeviceContext->End(disjoint0);

    Sleep(10); //Second sleep

    while (m_d3DeviceContext->GetData(disjoint0, NULL, 0, 0) == S_FALSE);

    D3D10_QUERY_DATA_TIMESTAMP_DISJOINT tsDisjoint;
    m_d3DeviceContext->GetData(disjoint0, &tsDisjoint, sizeof(tsDisjoint), 0);
    if (tsDisjoint.Disjoint)
        continue;;

    UINT64 frameStart, frameEnd;

    m_d3DeviceContext->GetData(queryStart, &frameStart, sizeof(UINT64), 0);
    m_d3DeviceContext->GetData(queryEnd, &frameEnd, sizeof(UINT64), 0);

    double time = 1000.0 * (frameEnd - frameStart) / (double)tsDisjoint.Frequency;

    DebugLog::Log(time);
}

The first sleep is not affecting GPU time at all (what is desirable, obviously), but the second one does. I made a lot of tries and it looks like sleeps before swapchain are ignored by GPU, but for some reason any sleep between swapchain present and getting data from disjoint causes increased GPU time by its value. Changing queries' ends places make no difference.

Why do I care about that sleep? In my real code I'm getting data of n frame and in the same frame getting data of frame n-1. So my GPU time results for n-1 are increased by time needed to evaluate n frame.

Why is it happening and what can I do to prevent that?

I think this is probably an artifact of when D3D11 decides to kick off / submit the command buffer that you're building behind the scenes. In D3D11 you essentially have no control over when a command buffer gets submitted - it could literally be on any GPU command for all you know.

Taking your code from above, imagine the sequence of operations was:


m_d3DeviceContext->Begin(disjoint0);
m_d3DeviceContext->End(queryStart);
Sleep(10); //First sleep
m_swapChain->Present(0, 0);			// Command buffer kick off here. GPU almost immediately does the work up until this point.
m_d3DeviceContext->End(queryEnd);	// This command is written but not executed yet...
m_d3DeviceContext->End(disjoint0);	// This command is written but not executed yet...
Sleep(10); //Second sleep			// CPU sleeps for 10ms
while (m_d3DeviceContext->GetData(disjoint0, NULL, 0, 0) == S_FALSE);	// GetData triggers a command buffer kickoff.

From the GPU's perspective:

1) In the first command buffer it saw Begin(disjoint0), End(queryStart) and Present(0,0).
2) No more GPU work was issued for 10ms
3) The second command buffer contained End(queryEnd), End(disjoint0) and potentially any work associated with GetData.

The time between the GPU consuming and executing the commands associated with queryStart and queryEnd was 10ms and that's why the second Sleep will affect the measured GPU time and the first one won't.

You're at the mercy of when kickoffs occur and I know of no way to guarantee that two calls to 'End' a timestamp query will reside in the same command buffer segment and therefore accurately measure the amount of GPU time taken between the two queries.

On D3D12 there's no such problem. When command lists get executed is entirely within title control and a command list is executed from beginning to end without the GPU taking a break.

Is there a particular reason you want to measure GPU time using timestamp queries? Are you trying to profile your game?

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Advertisement

Thanks for an explanation!

What I'm trying to do is tiny engine for comparing some AA algorithms so I would appreciate at least stable measurements. I saw in some other engines etc. separated time spent on GPU and CPU per frame and I wanted to do the same so I based my implementation on this article. I'm surprised that noone noticed similar issues. Your explanation make me sad, because I thought that it's just related with swapchain's present and it wouldn't have influence if I just measure specific algorithm. But if I understand you correctly it can..

I wouldn't completely write-off your approach, you are making the situation artificially worse right now by stalling on the results from the query immediately after the work is sent to the GPU. The article you linked to makes it clear that you should be double-buffering (at least) these queries and not stalling the CPU at any time waiting for the results - if you do that then I think you will see things improve.

I would also expect GPU timestamp queries wrapped around individual draw calls (or small sequences of draw calls) to be more accurate. Trying to measure the time an entire frame takes (including Present) without accidentally measuring the GPU idle time that comes at the end of each frame might prove tricky. If used carefully, you might also find that inserting a call to ID3D11DeviceContext::Flush after the last query in your frame (after Present) will kick pending commands off immediately rather than those commands ending in the next kickoff later in the next frame (potentially after the GPU has already been idling for a while).

If you can get to a point where the GPU is 100% busy and you're double or triple buffering your queries you should get back much more reliable information.

That said, there is no substitute (in my opinion) for using a GPU profiler like NSight, Radeon GPU Profiler, Intel GPA or PIX. These tools will do a much better job at profiling your GPU workloads than timestamp queries.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

I've definitely seen this problem before, for the reasons that Adam already explained. If you'd like to verify I would suggest learning how take an ETW capture with GPU capture enabled (UIforETW is handy for that) and then view it in GPUView. You'll be able to see how work actually get scheduled on the GPU in relation to your CPU timeline, which means you can see if there's any gaps where the GPU was starved.

In general I've found that this isn't too much of an issue in practice as long as you can make sure that you're issuing D3D commands as quickly as possible, ideally all grouped together in one part of the frame right before you call Present. 

@Adam Miles As I said in first post I pasted simplified code. In my real code I do multi buffering, but it makes things even worse. It looks like GPU time is increased by CPU time to the Swapchain::Present in the next frame.

Thanks for advices guys. I tried out Flushing as Adam suggested and it works! I'm gonna also try out some external profiler to compare results. Right now it looks stable enough.

PS: Nice to meet you @MJP I'm a big fan of your AA articles. I learned a lot from them!

This topic is closed to new replies.

Advertisement