Advertisement

DirectCompute - shared memory bank conflicts

Started by April 01, 2018 11:48 AM
2 comments, last by _void_ 6 years, 10 months ago

Hi guys,

I am implementing parallel prefix sum on DirectCompute and using GPU Gems 3 article as a reference for CUDA implementation.

In the article, the authors add logic to handle shared memory bank conflicts.

Mark Harris, one of the authors, claimed later at Stack Overflow that you do not need to handle explicitly bank conflicts in CUDA any more.

How about DirectCompute? Do you need to manage this yourself? Is there a difference between D3D10/D3D11/D3D12 versions?

 

Thanks!

This entirely depends on hardware, not the API. AMD has infos in its OpenCL optimization guides: https://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/

Personally i never tested those effects for shared (LDS) memory, but i know memory access patterns to global (main) memory matter a lot (achieved 2 x speedups often.) 

In addition to what is said in those guides, like having larger power of 2 strides between parallel threads is bad, i've noticed the same is true for serial access.

Example, each thread is doing this:

for (int i=0; i<n; i++) globalMemory[threadIndex*256+i] = x; // slow

... has very bad performance on GCN. To fix it change the stride to a non power of 2:

for (int i=0; i<n; i++) globalMemory[threadIndex*257+i] = x; // fast

This seems undocumented, so i mention it. For Nvidia both approaches performed equally fast for me.

 

 

Here is one more resource about the effect with LDS memory: http://diaryofagraphicsprogrammer.blogspot.co.at/2015/01/reloaded-compute-shader-optimizations.html

Seems it's more an issue for older hardware, not sure about NV.

 

 

Advertisement

@JoeJ Great :-)  Thanks for the links!

This topic is closed to new replies.

Advertisement