My edition is 6th (They have a bug with wrongly ordered execution/memory barriers, memoryBarrierShared(); barrier(); is correct.)
For cloth i was thinking on a very simple setting, like a grid with all valence 4 vertices so neighbours can be easily indexed, and simple verlet integration. Similar to bluring an image, or do some fluid / smoke simulation in a grid, the thing to learn is to cache things in LDS. Probably you should only implement things you already know about how they work. (Otherwise personally i still implement them on CPU first.)
16 minutes ago, Infinisearch said:
1 hour ago, JoeJ said:
Or more practical things like GPU culling (but i don't think there is much to learn here.)
I was thinking of implement frustum culling as a simple example. I was wondering if an append buffer is the best solution for outputting the list of visible entities?
Probably yes. (I'd fill a buffer feeding an indirect draw.)
Khronos APIs don't have append buffers, here you increase a counter with atomicAdd to get a destination index to write your stuff to a regular buffer. (Neither counter nor buffer are special - one reason why i think MS tends to overspecify.)
But this is a good example for a common optimization detail to avoid expensive atomics to global memory:
Instead appending to the global buffer by incrementing a global counter within EACH thread, we first write the list to LDS incrementing a local (group shared) counter (so in LDS as well). After the workgroup is done with the list (or it becomes too full), only ONE thread does the atomic add to global memory with the list size, and then we copy the list from LDS to global memory buffer. That's mostly faster (but not always).
I don't know if MS can decide to implement this by itself under the hood with append buffers, but i doupt it because we would loose control over exact LDS usage which affects occupancy.