DirectCompute: sync within warp

WS · 2017-11-06T17:16:27

In countless sources I've found that, when operating within a warp, one might skip syncthreads because all instructions are synchronous within a single warp. In CUDA-related sources. I followed that advice and applied it in DirectCompute (I use NV's GPU). I wrote this code that does nothing else but good old prefix-sum of 64 elements (64 is the size of my block): groupshared float errs1_shared[64]; groupshared float errs2_shared[64]; groupshared float errs4_shared[64]; groupshared float errs8_shared[64]; groupshared float errs16_shared[64]; groupshared float errs32_shared[64]; groupshared float errs64_shared[64]; void CalculateErrs(uint threadIdx) { if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[2*threadIdx] + errs4_shared[2*threadIdx + 1]; if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[2*threadIdx] + errs8_shared[2*threadIdx + 1]; if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[2*threadIdx] + errs16_shared[2*threadIdx + 1]; if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[2*threadIdx] + errs32_shared[2*threadIdx + 1]; } This works flawlessly. I noticed that I have bank conflicts in here so I changed that code to this: void CalculateErrs(uint threadIdx) { if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[threadIdx] + errs1_shared[threadIdx + 32]; if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[threadIdx] + errs2_shared[threadIdx + 16]; if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[threadIdx] + errs4_shared[threadIdx + 8]; if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[threadIdx] + errs8_shared[threadIdx + 4]; if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[threadIdx] + errs16_shared[threadIdx + 2]; if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[threadIdx] + errs32_shared[threadIdx + 1]; } And to my surprise this one causes race conditions. Is it because I should not rely on that functionality (auto-sync within warp) when working with DirectCompute instead of CUDA? Because that hurts my performance by measurable margin. With bank conflicts (first version) I am still faster by around 15-20% than in the second version, which is conflict-free but I have to add GroupMemoryBarrierWithGroupSync in between each assignment.

Graphics and GPU Programming Programming DX11

Started by maxest October 26, 2017 03:31 PM

10 comments, last by JoeJ 7 years, 3 months ago

JoeJ

4,406

November 06, 2017 05:16 PM

1 hour ago, maxest said:
Performance differs in both listings. Second one is around 15% faster.

What a shame.

You could use preprocessor for seperate code paths, like AMD_GCN, NV_KEPLER, NV_PASCAL, NV_SAVE etc...

If a future chip is not known by your app, you can use NV_SAVE with all the barriers. But there's still the small risk a driver update would brake NV_PASCAL.

DirectCompute: sync within warp

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

DirectCompute: sync within warp

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines