This entirely depends on hardware, not the API. AMD has infos in its OpenCL optimization guides: https://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/
Personally i never tested those effects for shared (LDS) memory, but i know memory access patterns to global (main) memory matter a lot (achieved 2 x speedups often.)
In addition to what is said in those guides, like having larger power of 2 strides between parallel threads is bad, i've noticed the same is true for serial access.
Example, each thread is doing this:
for (int i=0; i<n; i++) globalMemory[threadIndex*256+i] = x; // slow
... has very bad performance on GCN. To fix it change the stride to a non power of 2:
for (int i=0; i<n; i++) globalMemory[threadIndex*257+i] = x; // fast
This seems undocumented, so i mention it. For Nvidia both approaches performed equally fast for me.
Here is one more resource about the effect with LDS memory: http://diaryofagraphicsprogrammer.blogspot.co.at/2015/01/reloaded-compute-shader-optimizations.html
Seems it's more an issue for older hardware, not sure about NV.