1 hour ago, maxest said:Performance differs in both listings. Second one is around 15% faster.
What a shame.
You could use preprocessor for seperate code paths, like AMD_GCN, NV_KEPLER, NV_PASCAL, NV_SAVE etc...
If a future chip is not known by your app, you can use NV_SAVE with all the barriers. But there's still the small risk a driver update would brake NV_PASCAL.