Advertisement

Vulkan TLAS build causing device lost

Started by November 09, 2024 08:46 PM
4 comments, last by blameth 4 days, 10 hours ago

Hi everyone,

I'm working on a Vulkan-based TLAS (Top-Level Acceleration Structure) build, and after adding copy commands to the instance buffer, my application crashes with VkResult -4 (device lost) once the command vkCmdBuildAccelerationStructuresKHR is recorded and submitted with the validation error:

validation layer: Validation Error: [ VUID-vkDestroyFence-fence-01120 ] Object 0: handle = 0xb8de340000002988, type = VK_OBJECT_TYPE_FENCE; | MessageID = 0x5d296248 | vkDestroyFence(): fence (VkFence 0xb8de340000002988[]) is in use. The Vulkan spec states: All queue submission commands that refer to fence must have completed execution (https://vulkan.lunarg.com/doc/view/1.3.275.0/windows/1.3-extensions/vkspec.html#VUID-vkDestroyFence-fence-01120)

The fence crash is a result of the program hanging there due to something in the TLAS which is not correct, though I am struggling to understand what exactly. I followed the vulkan basic example closely on their Github and can't find too much difference from theirs and mine to cause a crash like this.

Here’s the part of the code where I do the copy to the instance buffer. It seems correct to me: Full code: https://pastebin.com/TCpEKp3D

auto instancesBuffer = new Buffer(V::CreateBuffer(sizeof(VkAccelerationStructureInstanceKHR), VK_BUFFER_USAGE_ACCELERATION_STRUCTURE_STORAGE_BIT_KHR | VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT | VK_BUFFER_USAGE_ACCELERATION_STRUCTURE_BUILD_INPUT_READ_ONLY_BIT_KHR | VK_BUFFER_USAGE_TRANSFER_DST_BIT, VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT, VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE));

std::vector<VkAccelerationStructureInstanceKHR> instances;
for (size_t i = 0; i < 1; ++i) {
    AS& blas = allBlas[i];  

    VkAccelerationStructureInstanceKHR instance = {};
        ...
    instance.accelerationStructureReference = blas.deviceAddress;
    instances.push_back(instance);
}

auto stagingBuffer = new Buffer(V::CreateBuffer(context.allocator, sizeof(VkAccelerationStructureInstanceKHR) * instances.size(),VK_BUFFER_USAGE_TRANSFER_SRC_BIT,VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT,VMA_MEMORY_USAGE_AUTO_PREFER_HOST));

void* mappedData;
vmaMapMemory(context.allocator.allocator, stagingBuffer->allocation, &mappedData);
memcpy(mappedData, instances.data(), sizeof(VkAccelerationStructureInstanceKHR) * instances.size());
vmaUnmapMemory(context.allocator.allocator, stagingBuffer->allocation);

VkBufferCopy copyRegion = {};
copyRegion.size = sizeof(VkAccelerationStructureInstanceKHR) * instances.size();
vkCmdCopyBuffer(cmdBuff, stagingBuffer->buffer, instancesBuffer->buffer, 1, &copyRegion);

VkBufferMemoryBarrier bufferBarrier{ VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER };
bufferBarrier.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
bufferBarrier.dstAccessMask = VK_ACCESS_ACCELERATION_STRUCTURE_WRITE_BIT_KHR | VK_ACCESS_SHADER_READ_BIT;
bufferBarrier.buffer = instancesBuffer->buffer;
bufferBarrier.size = VK_WHOLE_SIZE;
bufferBarrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
bufferBarrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;

// Copy data from CPU staging buffer to GPU
vkCmdPipelineBarrier(cmdBuff,VK_PIPELINE_STAGE_TRANSFER_BIT | VK_PIPELINE_STAGE_ACCELERATION_STRUCTURE_BUILD_BIT_KHR,VK_PIPELINE_STAGE_ACCELERATION_STRUCTURE_BUILD_BIT_KHR, 0,0, nullptr1, &bufferBarrier, 0, nullptr);

EndAndSubmitCommandBuffer(context, cmdBuff);

The error occurs at this line once I end and submit command buffer

VkCommandBuffer buildCmd = AllocateCommandBuffer(context, m_renderCommandPools[V::currentFrame].handle);
BeginCommandBuffer(buildCmd);
vkCmdBuildAccelerationStructuresKHR(
			buildCmd,
			1,
			&accelerationBuildGeometryInfo,
			accelerationBuildStructureRangeInfos.data());
 
EndAndSubmitCommandBuffer(context, buildCmd);

Aftermath report which I do not understand

Image

blameth said:
vkDestroyFence(): fence (VkFence 0xb8de340000002988[]) is in use.

The error message allows to track down the fence object. So you can figure out where it is destoryed while it's still in use.

I don't see fences in your posted code, so probably the mistake is elsewhere.

Maybe you forgot to wait on the fence, and building the AS takes so much time the issue now causes a crash which was undetected before.

Advertisement

@JoeJ Thanks for the reply! the fence does get waited on but the max time it can wait is reached and it moves on. I'm unsure why building the TLAS takes so long. There has to be an issue with TLAS set up for something to go wrong to the point the device is lost. The end and submit function, part of the full code has the fence shown which is waited for.

inline void EndAndSubmitCommandBuffer(const VulkanContext& context, VkCommandBuffer cmd)
	{
		vkEndCommandBuffer(cmd);
 
		Fence complete = CreateFence(context.device);
 
		vkResetFences(context.device, 1, &complete.handle);
 
		VkSubmitInfo submit{ VK_STRUCTURE_TYPE_SUBMIT_INFO };
		submit.commandBufferCount = 1;
		submit.pCommandBuffers = &cmd;
 
		V_VK_CHECK(vkQueueSubmit(context.graphicsQueue, 1, &submit, complete.handle), "Failed to submit command buffer");
 
		VkResult res = vkWaitForFences(context.device, 1, &complete.handle, VK_TRUE, UINT64_MAX);
		std::cout << "Result of Fences: " << res << std::endl;
	}

blameth said:
I'm unsure why building the TLAS takes so long.

Does it work with a simpler mesh? Maybe the mesh has degenerated triangles or something like that, confusing the builder.
Or, if it's very big, maybe you need to divide it into multiple pieces.

@JoeJ If I do 1 primitive count and one instance it builds just fine. Though I have no idea why. Increasing the primitive and instance count (same thing really) will stop it from working and cause the error

Advertisement