I'd like to say a few things, since I've recently written a voxelizer in compute and done a fair amount of research:
1. GPU Rasterization to voxelize high poly meshes is a bad idea. GPUs are already bad at rasterizing tiny triangles, but this gets further aggravated by the fact that this approach requires interlocked operations and the high density of vertices means there is a lot of contention due to multiple threads trying to write into the same voxel block.
Some papers mention their implementations have a drastic time increase with poly count due to contention. Triangle density per voxel also plays a big role, because it's not the same to have a mesh that has each voxel touch one or two triangles, than a mesh that has a single voxel with 600 triangles going through it.
Another problem which most papers except a few often fail to mention (probably out of ignorance) is that unless the voxelization process is very simple, you need to blend your results; and there is no "interlocked average" instruction.
Therefore implementations perform a mutex-like locking of a voxel. This is a problem because such approaches can result in an infinite loop because half a warp acquires the lock while another warp(s) acquires the other half, thus they will fight forever for acquiring the lock.
Implementations that fail to account for this will result in a TDR, which is not immediately obvious unless you're working with high poly meshes, which is where contention happens and the infinite loop cases appear.
Implementations that successfully account for this add a 'bail out' counter: If the mutex acquisition takes more than N spins, give up. This means the voxelization process may not be accurate, and worse it may not even be deterministic. But at least TDR won't happen.
You could append those failure cases into a list and process them at the end serially though.
The only way to properly implement this is using Independent Thread Scheduling introduced by Volta, and is only supported by NVIDIA GPUs (at the time of writing).
This problem may not apply to you though, if you don't need any complex per voxel average/mutex. If a simple interlocked operation (like atomic addition) is enough, then ignore this drawback.
You can avoid the "atomic blend" problem if your 3D texture is in float format, and track the accumulated weights in another 3D texture. This consumes a ton of memory. The "atomic blend" problem appears because of memory restrictions, thus we want to blend an RGBA8 texture or similar precision.
2. That leaves the opposite approach: Have each thread perform a box vs triangle test against all primitives.
A brute force approach like that is super slow even for a GPU, much worse than doing GPU rasterization.
However it can be greatly improved using hierarchy culling: partition the mesh into smaller submeshes, calculating its AABB, and then skipping all of those triangles by performing an AABB vs AABB test.
The compute approach can be further improved by having each thread in a warp load a different triangle, and use anyInvocationARB to test if any of the 64 triangles intersects the AABB that enclosees all voxels processed by the warp.
If you're lost about this, I explain this optimization in a Stack Overflow reply.
While the theoretical performance improvement is up to 64x, in practice this optimization has yield us gains anywhere between 3x-32x depending on the scene involved (often between 3x-4x).
This is what I ended implementing for Ogre 2.2; you're welcome to try our Test_Voxelizer.exe sample (build Ogre 2.2 using the Quick Start script). Find a way to load your mesh data as an Ogre mesh, modify the same to load this mesh of yours; and time how long it takes. That way you can easily test if this approach is worth pursuing or not.
If it's not, then go back to the thinktank for something else.
Note that you should test different values of indexCountSplit in 'mVoxelizer->addItem( *itor++, false, indexCountSplit );' as that value controls how big each partition is, and this can have a huge impact in voxelization performance. There is no 'right' global value, as the best value depends on how your mesh' vertex data is layed out in memory and how much space each partition ends up covering.
Good luck
Cheers