Welcome back. This time we'll take a look at one of the most common operations in a 3D graphics engine, frustum culling. How fast can such code become by using SIMD?
The problem
Classify whether a batch of AABBs are completely inside, completely outside or intersecting a frustum (6 planes).
Restrictions
- AABBs are defined as (Center, Extent) pairs.
- All vectors are Vector3f's.
Structs and initialization
#define OUTSIDE 0
#define INSIDE 1
#define INTERSECT 2 // or 3 depending on the algorithm used (see the discussion on the 2nd SSE version)
struct Vector3f
{
float x, y, z;
};
struct AABB
{
Vector3f m_Center;
Vector3f m_Extent;
};
struct Plane
{
float nx, ny, nz, d;
};
All the functions presented in the snippets below have the same signature:
void CullAABBList(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState);
As you can see, I've taken the state of the AABB (with regard to the specified frustum) outside of the struct itself. This has a couple advantages.
- You can cull the same AABB list with multiple different frustums in parallel without worrying about conflicts. Just pass a different aabbState array to each function call and you are done.
- All relevant AABB data are close to each other, which helps reducing cache misses (since the array is expected to be read sequentially).
Finally, all arrays are assumed to be 16-byte aligned. This is required for the SSE version.
Let's take a look at some code.
C++ (reference implementation)
// Performance (cycles/AABB): Average = 102.5 (stdev = 12.0)
void CullAABBList_C(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState)
{
for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB)
{
const Vector3f& aabbCenter = aabbList[iAABB].m_Center;
const Vector3f& aabbSize = aabbList[iAABB].m_Extent;
unsigned int result = INSIDE; // Assume that the aabb will be inside the frustum
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
const Plane& frustumPlane = frustumPlanes[iPlane];
float d = aabbCenter.x * frustumPlane.nx +
aabbCenter.y * frustumPlane.ny +
aabbCenter.z * frustumPlane.nz;
float r = aabbSize.x * fast_fabsf(frustumPlane.nx) +
aabbSize.y * fast_fabsf(frustumPlane.ny) +
aabbSize.z * fast_fabsf(frustumPlane.nz);
float d_p_r = d + r;
float d_m_r = d - r;
if(d_p_r < -frustumPlane.d)
{
result = OUTSIDE;
break;
}
else if(d_m_r < -frustumPlane.d)
result = INTERSECT;
}
aabbState[iAABB] = result;
}
}
The code is based on method 4c from an excellent post by Fabian Giesen on AABB culling:
View frustum culling. If you haven't read it yet, check it out now. He presents several different ways of performing the same test, each of which has its own advantages and disadvantages.
There's nothing special about the above code, except from the fact that it breaks out of the inner loop as soon as the box is classified as completely outside of one of the planes. No need to check the others.
But we can do better. First thing, precalculate the absolute of the plane normals once, outside of the AABB loop. No need to recalculate the same thing over and over. We can also replace the floating point comparisons with bitwise ANDs. Lets take a look at the code.
C++ Optimized
// Performance (cycles/AABB): Average = 84.9 (stdev = 12.3)
void CullAABBList_C_Opt(AABB* __restrict aabbList, unsigned int numAABBs, Plane* __restrict frustumPlanes, unsigned int* __restrict aabbState)
{
Plane absFrustumPlanes[6];
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
absFrustumPlanes[iPlane].nx = fast_fabsf(frustumPlanes[iPlane].nx);
absFrustumPlanes[iPlane].ny = fast_fabsf(frustumPlanes[iPlane].ny);
absFrustumPlanes[iPlane].nz = fast_fabsf(frustumPlanes[iPlane].nz);
}
for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB)
{
const Vector3f& aabbCenter = aabbList[iAABB].m_Center;
const Vector3f& aabbSize = aabbList[iAABB].m_Extent;
unsigned int result = INSIDE; // Assume that the aabb will be inside the frustum
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
const Plane& frustumPlane = frustumPlanes[iPlane];
const Plane& absFrustumPlane = absFrustumPlanes[iPlane];
float d = aabbCenter.x * frustumPlane.nx +
aabbCenter.y * frustumPlane.ny +
aabbCenter.z * frustumPlane.nz;
float r = aabbSize.x * absFrustumPlane.nx +
aabbSize.y * absFrustumPlane.ny +
aabbSize.z * absFrustumPlane.nz;
float d_p_r = d + r + frustumPlane.d;
if(IsNegativeFloat(d_p_r))
{
result = OUTSIDE;
break;
}
float d_m_r = d - r + frustumPlane.d;
if(IsNegativeFloat(d_m_r))
result = INTERSECT;
}
aabbState[iAABB] = result;
}
}
fast_fabsf() and
IsNegativeFloat() are simple functions which treat the float as int and remove/check the MSB.
I don't have any other C-level optimizations in mind. So let's see what we can do using SSE.
SSE (1 AABB at a time)
// 2013-09-10: Moved outside of the function body. Check comments by @Matias Goldberg for details.
__declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF};
// Performance (cycles/AABB): Average = 63.9 (stdev = 10.8)
void CullAABBList_SSE_1(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState)
{
__declspec(align(16)) Plane absFrustumPlanes[6];
__m128 xmm_absPlaneMask = _mm_load_ps((float*)&absPlaneMask[0]);
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
__m128 xmm_frustumPlane = _mm_load_ps(&frustumPlanes[iPlane].nx);
__m128 xmm_absFrustumPlane = _mm_and_ps(xmm_frustumPlane, xmm_absPlaneMask);
_mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane);
}
for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB)
{
__m128 xmm_aabbCenter_x = _mm_load_ss(&aabbList[iAABB].m_Center.x);
__m128 xmm_aabbCenter_y = _mm_load_ss(&aabbList[iAABB].m_Center.y);
__m128 xmm_aabbCenter_z = _mm_load_ss(&aabbList[iAABB].m_Center.z);
__m128 xmm_aabbExtent_x = _mm_load_ss(&aabbList[iAABB].m_Extent.x);
__m128 xmm_aabbExtent_y = _mm_load_ss(&aabbList[iAABB].m_Extent.y);
__m128 xmm_aabbExtent_z = _mm_load_ss(&aabbList[iAABB].m_Extent.z);
unsigned int result = INSIDE; // Assume that the aabb will be inside the frustum
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
__m128 xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nx);
__m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component);
xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].ny);
xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_y, xmm_frustumPlane_Component));
xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nz);
xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_z, xmm_frustumPlane_Component));
__m128 xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nx);
__m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component);
xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].ny);
xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_y, xmm_absFrustumPlane_Component));
xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nz);
xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_z, xmm_absFrustumPlane_Component));
__m128 xmm_frustumPlane_d = _mm_load_ss(&frustumPlanes[iPlane].d);
__m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
__m128 xmm_d_m_r = _mm_add_ss(_mm_sub_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
// Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps
__m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0));
int negativeMask = _mm_movemask_ps(xmm_d_p_r__d_m_r);
// Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r
if(negativeMask & 0x01)
{
result = OUTSIDE;
break;
}
else if(negativeMask & 0x04)
result = INTERSECT;
}
aabbState[iAABB] = result;
}
}
Again, since we are processing one AABB at a time, we can break out of the inner loop as soon as the AABB is found to be completely outside one of the 6 planes.
The SSE code should be straightforward. All arithmetic operations are scalar and we use
_mm_movemask_ps in order to extract the signs of (d + r) and (d - r). Depending on the signs, we classify the AABB as completely outside or intersecting the frustum. Even though we are testing one AABB at a time, the code is faster than the optimized C++ implementation.
The "problem" with the above snippet is that all SSE operations are scalar. We can do better by swizzling the data from 4 AABBs in such a way that it will be possible to calculate 4 (d + r) and 4 (d - r) simultaneously. Let's try that.
SSE (4 AABBs at a time)
// 2013-09-10: Moved outside of the function body. Check comments by @Matias Goldberg for details.
__declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF};
// Performance (cycles/AABB): Average = 24.1 (stdev = 4.2)
void CullAABBList_SSE_4(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState)
{
__declspec(align(16)) Plane absFrustumPlanes[6];
__m128 xmm_absPlaneMask = _mm_load_ps((float*)&absPlaneMask[0]);
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
__m128 xmm_frustumPlane = _mm_load_ps(&frustumPlanes[iPlane].nx);
__m128 xmm_absFrustumPlane = _mm_and_ps(xmm_frustumPlane, xmm_absPlaneMask);
_mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane);
}
// Process 4 AABBs in each iteration...
unsigned int numIterations = numAABBs >> 2;
for(unsigned int iIter = 0;iIter < numIterations;++iIter)
{
// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
// Load the 4 Center/Extents pairs for the 4 AABBs.
__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Center.x);
__m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Extent.y);
__m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].m_Center.z);
__m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Center.x);
__m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Extent.y);
__m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].m_Center.z);
// Shuffle the data in order to get all Xs, Ys, etc. in the same register.
__m128 xmm_cx0_cy0_cx1_cy1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_ey0_ez0_cx1_cy1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_cx2_cy2_cx3_cy3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_ey2_ez2_cx3_cy3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbCenter0123_x = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbCenter0123_y = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(3, 1, 3, 1));
__m128 xmm_cz0_ex0_cz1_ex1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_cz2_ex2_cz3_ex3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_aabbCenter0123_z = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_x = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(3, 1, 3, 1));
__m128 xmm_ey0_ez0_ey1_ez1 = _mm_shuffle_ps(xmm_ey0_ez0_cx1_cy1, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_ey2_ez2_ey3_ez3 = _mm_shuffle_ps(xmm_ey2_ez2_cx3_cy3, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbExtent0123_y = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_z = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(3, 1, 3, 1));
unsigned int in_out_flag = 0x0F; // = 01111b Assume that all 4 boxes are inside the frustum.
unsigned int intersect_flag = 0x00; // = 00000b if intersect_flag == 1 then this box intersects the frustum.
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
// Calculate d...
__m128 xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nx);
__m128 xmm_d = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_x);
xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].ny);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_y);
xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component);
xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nz);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_z);
xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component);
// Calculate r...
xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nx);
__m128 xmm_r = _mm_mul_ps(xmm_aabbExtent0123_x, xmm_frustumPlane_Component);
xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].ny);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_y);
xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component);
xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nz);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_z);
xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component);
// Calculate d + r + frustumPlane.d
__m128 xmm_d_p_r = _mm_add_ps(xmm_d, xmm_r);
xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].d);
xmm_d_p_r = _mm_add_ps(xmm_d_p_r, xmm_frustumPlane_Component);
// Check which boxes are outside this plane (if any)...
// NOTE: At this point whichever component of the xmm_d_p_r reg is negative, the corresponding
// box is outside the frustum.
unsigned int in_out_flag_curPlane = _mm_movemask_ps(xmm_d_p_r);
in_out_flag &= ~in_out_flag_curPlane; // NOTed the mask because it's 1 for each box which is outside the frustum, and in_out_flag holds the opposite.
// If all boxes have been marked as outside the frustum, stop checking the rest of the planes.
if(!in_out_flag)
break;
// Calculate d - r + frustumPlane.d
__m128 xmm_d_m_r = _mm_sub_ps(xmm_d, xmm_r);
xmm_d_m_r = _mm_add_ps(xmm_d_m_r, xmm_frustumPlane_Component);
// Check which boxes intersect the frustum...
unsigned int intersect_flag_curPlane = _mm_movemask_ps(xmm_d_m_r);
intersect_flag |= intersect_flag_curPlane;
}
// Calculate the state of the AABB from the 2 flags.
// If the i-th bit from in_out_flag is 0, then the result will be 0 independent of the value of intersect_flag
// If the i-th bit from in_out_flag is 1, then the result will be either 1 or 2 depending on the intersect_flag.
aabbState[(iIter << 2) + 0] = ((in_out_flag & 0x00000001) >> 0) << ((intersect_flag & 0x00000001) >> 0);
aabbState[(iIter << 2) + 1] = ((in_out_flag & 0x00000002) >> 1) << ((intersect_flag & 0x00000002) >> 1);
aabbState[(iIter << 2) + 2] = ((in_out_flag & 0x00000004) >> 2) << ((intersect_flag & 0x00000004) >> 2);
aabbState[(iIter << 2) + 3] = ((in_out_flag & 0x00000008) >> 3) << ((intersect_flag & 0x00000008) >> 3);
}
// Process the rest of the AABBs one by one...
for(unsigned int iAABB = numIterations << 2; iAABB < numAABBs;++iAABB)
{
// NOTE: This loop is identical to the CullAABBList_SSE_1() loop. Not shown in order to keep this snippet small.
}
}
Compared to the original (unoptimized) C++ version, it's 4x faster. But that's unfair. Compared to the scalar SSE version it's about 2.5x faster. Not bad, don't you think? Can it get any better? Probably yes. The problem with the 2 SSE versions is the lack of enough XMM registers to keep all the AABB data in them and avoid using the stack for storing intermediate results. Unfortunately, we need 6 registers for the 4 Center/Extent pairs, and the rest (2) aren't enough for the inner loop. Compiling under 64-bits should give better results because of that.
Another way to optimize this algorithm further is to have the data already laid out the way the code expects them (in SoA form). This will get rid of the shuffles (the memory loads will still be 6).
Finally, the code used to calculate the state of each AABB from the two bitfields can change. One other way of doing it is the following:
intersect_flag &= in_out_flag;
intersect_flag <<= 1;
aabbState = ((in_out_flag & (1 << i)) | (intersect_flag & (1 << (i + 1)))) >> i
This way, we can get rid of the variable shifts (all shifts and ANDs are compile-time constants). The difference from the previous version is that the intersection state is 3 instead of 2. 2 isn't a valid state, that's why we AND the
intersection_flag with the
in_out_flag. 2 means that the box is outside
and intersecting the frustum, which is an invalid case.
Unfortunately, in practice this doesn't make a big difference in performance. So, it's a matter of taste.
Results
Below is a table with more performance data from all the above snippets. Two cases have been tested. For both of them, the frustum is the [0, 1]^3 box. The first one is a random case, where all the boxes are randomly placed in the [-1.0, 2.0] box, and have random sizes in [0.1, 0.2] range. The other case is the worst case, where all boxes are inside the frustum.
MethodRandom (32)Worst (32)Random (1024) *Worst (1024)
C++ (ref)105.2 (14.0)181.5 (21.0)102.5 (12.0)159.3 (20.3)
C++ (opt)96.1 (10.3)138.0 (17.2)84.9 (12.3)119.8 (16.3)
SSE_173.7 (9.8)93.2 (13.8)63.9 (10.8)72.8 (9.8)
SSE_426.5 (4.0)27.6 (4.3)24.1 (4.2)22.9 (4.0)
Table: Performance comparison between the 4 snippets. 2 batch sizes, 32 and 1024 AABBs. The column marked with * contains the data shown in the post.
That's all folks. Thanks for reading. Any corrections/suggestions are welcome.
Changes
2013-09-06: Changed C to C++ because the code uses references. Thanks to @zdlr for pointing that out.
2013-09-10: Removed underscore from structure names (comment by @NightCreature83). Also moved definition of absFrustumPlaneMask outside of the two SSE functions (see comments by @Matias Goldberg for details).
Thanks for the interesting read!