Advertisement

Depth mask using InterlockedMin - what is wrong with this function? [solved]

Started by March 10, 2018 06:01 PM
5 comments, last by sebjf 6 years, 10 months ago

Hi,

I am trying to brute-force a closest-point-to-closed-triangle-mesh algorithm on the GPU by creating a thread for each point-primitive pair and keeping only the nearest result for each point. This code fails however, with multiple writes being made by threads with different distance computations.

To keep only the closest value, I attempt to mask using InterlockedMin, and a conditional that only writes if the current thread holds the same value as the mask after a memory barrier.

I have included the function below.

As can be seen I have modified it to write to a different location every time the conditional succeeds for debugging. It is expected that multiple writes will take place, for example where the closest point is a vertex shared by multiple triangles, but when I read back closestPoints and calculate the distances, they are different, which should not be possible.

The differences are large (~0.3+) so I do not think it is a rounding error. The CPU equivalent works fine for a single particle. After the kernel execution, distanceMask does hold the smallest value, suggesting the problem is with the barrier or the conditional.

Can anyone say what is wrong with the function?


RWStructuredBuffer<uint> distanceMask : register(u4);
RWStructuredBuffer<uint> distanceWriteCounts : register(u0);
RWStructuredBuffer<float3> closestPoints : register(u5);


[numthreads(64,1,1)]
void BruteForceClosestPointOnMesh(uint3 id : SV_DispatchThreadID)
{
    int particleid = id.x;
    int triangleid = id.y;

    Triangle t = triangles[triangleid];

    float3 v0 = GetVertex1(t.i0);
    float3 v1 = GetVertex1(t.i1);
    float3 v2 = GetVertex1(t.i2);
    float3 q1 = Q1[particleid];

    ClosestPointPointTriangleResult result = ClosestPointPointTriangle(q1, v0, v1, v2);
    float3 p = v0 * result.uvw.x + v1 * result.uvw.y + v2 * result.uvw.z;

    uint distance = asuint(length(p - q1));

    InterlockedMin(distanceMask[particleid], distance);

    AllMemoryBarrierWithGroupSync();

    if(distance == distanceMask[particleid])
    {
        uint bin = 0;
        InterlockedAdd(distanceWriteCounts[particleid],1,bin);
        closestPoints[particleid * binsize + bin] = p;
    }
}

 

This may not be the reason, but it will cause problems:

uint distance = asuint(length(p - q1));

The asuint() function stores the floats bit pattern in your uint. However float bit patterns are complicated, so you cannot simply compare their bit patterns as uints to determine which is a higher number.

It would be safer to simply typecast to uint, or maybe multiply by 1000 then typecast to uint for more accuracy.

 

Advertisement
49 minutes ago, CortexDragon said:

However float bit patterns are complicated, so you cannot simply compare their bit patterns as uints to determine which is a higher number.

Assuming that the numbers are all positive, you actually can -- the bitpatterns of positive floats are ordered correctly when interpreted as integers. This is often used to allow floating point data to be sorted with O(N) counting-sort algorithms such as the radix sort.

Interpreting a positive float's bitpattern as an integer, incrementing it by one, and then interpreting the result as a float, always gives you the next highest float value that can be represented.

Thanks for the replies!

I have just found the problem though (its me being dumb and not reading what is in front of my eyes!). AllMemoryBarrierWithGroupSync() prevents race conditions by ensuring all threads within the group have reached the same place before passing the barrier. However, I am dispatching many groups, at least one per triangle, so this pattern will not work.

(The buffer declaration would also be wrong even if it would, needing globallycoherent prefix)

Why do you need a group barrier at all? Can't you just use the optional 'originalValue' argument of InterlockedMin rather than re-reading the value you just InterlockedMin'ed into?

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Hi ajmiles,

Unfortunately that introduces another race condition:

Thread 1 Writes InterlockedMin with B, gets A in original_value, where B < A
Thread 1 checks B < A
Thread 2 Writes InterlockedMin with C (where C < B), gets B in original_value
Thread 2 Writes Y to Positions
Thread 1 writes X to Positions

Positions now contains X, even though it should have Y, because between Thread 1 making the check and making the write, Thread 2 has written its value.

I went with a simple solution: run the triangle-distance test twice - one pass generates the mask (as you suggest), the other checks it and writes the position. Efficiency is not important because this is a reference implementation against which I can debug faster algorithms, which hopefully won't need global sync to memory (which in itself is slow).

It would be neat to know how to do a global sync within a shader, for interests sake though...

This topic is closed to new replies.

Advertisement