Compute Shader Works Intel not Nvidia

James e · 2020-08-04T20:16:16

Hello, Im trying to run a compute shader to generate a coarse grid of intensities for a MRI volumetric scan, when i run the program on Intel integrated GPU it functions, but on a Nvidia card it doesn't. Intel Result: Nvidia Result: Shader Code: https://pastebin.com/ZxzxfiCR

Graphics and GPU Programming Programming

Started by Jman2 July 31, 2020 12:23 PM

19 comments, last by NikiTo 4 years, 6 months ago

NikiTo

245

August 02, 2020 12:26 AM

Changing the dimensions have some effect -

If your Texture3D is too big for the Nvidia VRAM, and if DX11 allows you to create too big resources. The reads/writes beyond the limits, will be ignored or return zeros. Intel GPU uses the RAM of the computer which is often more than the VRAM of the GPU. This could be the origin of your problem. The easisest way to test for this is to create a very small Texture3D. Let say 64x64x64 and run it on both GPUs. Using 4,4,4(2,2,2) again.

(Texture3D of 256xxx FLOAT takes only 64 MBs but maybe you have the VRAM full of other resources already.)

Try that 64x64x64(32x32x32?) thing next, in case you declared more resources and you think the VRAM could be running out.

I don't know if DX11 allows for too much resources to be created above the VRAM amount.

I don't know either if you can reinterpret a resource in DX11 just like you can do in DX12. If you are reinterpreting the contents of the resource, then the differences in the way the GPUs layout data internaly could give you differences.

You don't need a Texture3D at all if you are going to use Load() instead Sample() to read from it.
Maybe you can get some speedup from better adjacency, but it depends if the driver did the extra work to internally layout the data in a better way. This could require more VRAM than 256x256x256x4.

Sorry i can not suggest more to you. Differences that could produce a different result between Intel and NV are wavefront size, VRAM amounts, and in case of overflow or reinterpreting, the way GPU layouts data internally(and slight floating point math variations too, but the differences would look more random).

Jman2

Author

161

August 02, 2020 07:50 AM

@NikiTo The program only bounds few cube maps for sky box and 1 cube mesh and 1 texture1D transfer function, I tried a 64x64x64 volume and the results are interesting.

Intel Result:

Nvidia Result:

I still find it unusual that the normal calculations are fine and the texture produced is equal to the source as i store the source iso value int he Alpha channel and normal in RGB etc. But when producing a tiny (in this case 8x8x8) texture which tracks the intensity of iso values in regions it produces lines. If i don't bind the normal texture to reduce memory the same result occurs so it must be something else. Wonder why the 128x256x256 case worked on intel though.

NikiTo

245

August 02, 2020 02:02 PM

It is time to manually revise everything in your whole pipeline.
You take a paper and pencil and manually execute your shaders. Manually compute the sizes of resources and all.

On DX12 it often happens to me that writing to one resource overflows into other close resource.
If my addressing is bad, writing outside the limits of a resource, is not ignored, but it writes into the next resource. Like if everything was a single resource. And if the shader writes by error to too high addresses, the app hangs.

It happened to me to manually revise a shader tens of times and find no errors, and after many days and nearly going crazy, to find out that another shader in the pipeline overflowed and was writing over the result of the current shader.

Time for you to revise it everything, line by line, word by word.

When you find the error, share it here for us to know.

Jman2

Author

161

August 04, 2020 10:09 AM

@NikiTo Right so i made a little progress, swapping the RWTexture to be float and on the CPU end create it as a R8_Unorm allows the skull data to be generated into the UAV successfully.

However the staging texture seems to be the core of the issue as it still produces the lines we have been seeing. But the interesting thing is the pitch for the mapped resource is completely wrong, stating 128 bytes instead of 16 bytes.

I will try find out why but its difficult to know whats going on atm, maybe my Resource Handle offset has bugged…

Edit:
Im wondering if the 128 byte pitch is due to floats being 4bytes per element, and copying back to the cpu side to fill a R8_Unorm buffer means everything is off by a factor of 4?

Jman2

Author

161

August 04, 2020 06:47 PM

@NikiTo Okay i think i finally got it i had to copy each row / slice out manually like this

   // Grab the data and unmap
    Byte* sptr = (Byte*)stream.pData;
    Byte* dptr = (Byte*)data;
    Uint32 rowCount = stream.DepthPitch / stream.RowPitch;
    Uint32 sliceCount = byteCount / stream.DepthPitch;

    size_t msize = std::min<size_t>(pDesc->Pitch, stream.RowPitch);
    for (size_t d = 0; d < pDesc->Depth; d++)
    {
        for (size_t h = 0; h < rowCount; h++)
        {
            memcpy_s(dptr, pDesc->Pitch, sptr, msize);
            sptr += stream.RowPitch;
            dptr += pDesc->Pitch;
        }
    }

NikiTo

245

August 04, 2020 07:01 PM

Does it produce the same result on both GPUs now?

Jman2

Author

161

August 04, 2020 07:18 PM

@NikiTo Yes this now works on both Intel and nvidia ?

NikiTo

245

August 04, 2020 07:21 PM

So it was the addressing what you modified in order to work?

Jman2

Author

161

August 04, 2020 07:37 PM

@NikiTo Im guessing yes your assumption was correct, for some reason this specific texture results in a row pitch of 128 bytes? even though the texture is created with a row pitch of 16 bytes, the GPU must optimize memory layout? I was under the impression we got givena pointer and couple just straight up memcpy the entire lot, but no you have to memcpy through each row and skip however many the GPU decided to have in its pitch?

Someone else could probably explain it better, but mirror offsetting the pitches on the CPU and GPU side by there respective pitchs fixed it.

NikiTo

245

August 04, 2020 08:16 PM

Jman2 said:
Im guessing yes your assumption was correct,

I think you was doing reinterpreting, But somebody who is working with DX11 should know better than me.

In DX12, if you interpret the resource as something else, the layout produces different results between GPUs. You interpreted a pitch of 16 as if it is a pitch of 128. Or vice versa.

You read it as if it has a pitch X but it has actually a pitch of Y.

I don't know how it is in DX11, but you should have ways to read data from the GPU without getting different results. If you created it as texture3D, always access it, read write copy and so, only as a texture3D. If you read/treat/interprete it as texture2D you will have different results on different GPUs.

Layout is not a problem at all if you never reinterpret the resources as something else, something different than the thing it was initially created to be. In fact, you should ignore the internal layout of the GPU.

Maybe somebody who uses DX11 here can help you with the exact commands to use to copy between resources in a layout agnostic way.

In my case, i always create flat buffers. It gives me less headaches.

I would recommend you to make exhaustive testings now, before saving it to a folder named “aCopyThatWorks”. Make sure you are good to keep going.
One of the most cruel situations is to think you fixed it and then, few weeks later to find out it is still not working.

Compute Shader Works Intel not Nvidia

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Compute Shader Works Intel not Nvidia

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines