Advertisement

Issue with packed uints in shader on Intel HD-Chips

Started by November 04, 2022 02:07 PM
21 comments, last by DukeThrust 2 years, 2 months ago

Hello,

I'm experiencing issues with my tilemap-shader, on a wide range of "Intel HD"-GPU chipsets. The issue was introduced with, and seems to stem from me packing the coordinates of the tilemap-lookup in one float in a texture. The algorithm:

const auto makePackedTile = [&](uint16_t offX, uint16_t offY, uint8_t autotileId)
{
	AE_ASSERT(autotileId <= 8);
	uint32_t packed = autotileId;
	
	AE_ASSERT(offX <= 512);
	AE_ASSERT(offY <= 512);

	packed |= offX << 4;
	packed |= offY << 18;

	AE_ASSERT((packed & 0x0F) == autotileId);
	AE_ASSERT(((packed >> 4) & 0x3fff) == offX);
	AE_ASSERT((packed >> 18) == offY);

	return std::bit_cast<float>(packed);
};

As validated by the asserts at the end, this should theoretically yield the correct results (and it does so on any other GPU). This is being written to a 32-bit floating point two-component texture (RGF32), with an additional z-value in the other component. The shader-code for reading the data:

int2 calculateTileOffsets(uint packedTile, float autotileData[9])
{
	uint autotileId = packedTile & 0x0F;
	
	int2 vOffset;
	vOffset.x = (packedTile >> 4) & 0x3fff;
	vOffset.y = autotileData[autotileId];
	vOffset.y += packedTile >> 18;
	
	return vOffset;
}

// usage:
float2 tile = Tilemap.SampleLevel( PointClamp, float3(In[0].vPos.zw / vTileHeight.yz,(Out.layer + 0.5f)/numRealLayers),  0);
uint packedTile = asuint(tile.x);
if (packedTile != -1)
{
	Out.vPos.z = 1.0f - (tile.y + curr * 0.0000001f);
	float2 vTex0 = calculateTileOffsets(packedTile, autotileData) * vTileHeight.wx;

I've been able to track the problem down to that function and the data extracted from “packedTile”. If I replace the evaluation of vOffset.x/y with static values, those tiles are then rendered. Also, when I write arbitrary float-values into the texture, and use them in place of “tile.x”, the result is the same, meaning the sample of the actual texture should work. Also, I've checked that the state being bound and the content of all the textures and cbuffers are the same on the PC where the shader works, and the one where it doesn't. Just in case you wonder, here's the complete shader (which uses a geometry-shader):

#pragma pack_matrix( row_major )
#pragma ruledisable 0x0802405f
cbuffer Instance: register(b3)
{
	float4 vOffset;
	float4 vTileOffset;
	float tileSize;
	float4 vInvTileSize;
	float4 vTileHeight;
	float3 vLayers;
	float activeLayer;
	float4 vInvTilesetSize;
	float4 vColorTone;
};

cbuffer Custom0: register(b4)
{
	float autotileData[9];
};

Texture3D Tilemap: register(t0);
Texture2D Image: register(t1);

sampler PointClamp : register(s0);
int2 calculateTileOffsets(uint packedTile, float autotileData[9])
{
	uint autotileId = packedTile & 0x0F;
	int2 vOffset;
	vOffset.x = (packedTile >> 4) & 0x3fff;
	vOffset.y = autotileData[autotileId];
	vOffset.y += packedTile >> 18;
	return vOffset;
}

struct VERTEX_IN
{
	float posIndex: SV_POSITION0;
};

struct VERTEX_OUT
{
	float4 vPos: SV_POSITION0;
};

VERTEX_OUT mainVS(VERTEX_IN In, uint VertexId : SV_VertexID, uint InstanceId : SV_InstanceID)
{
	VERTEX_OUT Out;
	uint offset = vOffset.w;
	float indexX = VertexId % offset;
	float indexY = floor(VertexId / offset);
	Out.vPos.x = indexX * tileSize;
	Out.vPos.y = indexY * tileSize;
	Out.vPos.xy += vOffset.xy;
	Out.vPos.x = Out.vPos.x * vTileOffset.z- 1.0f;
	Out.vPos.y = -Out.vPos.y * vTileOffset.w + 1.0f;
	indexX += vTileOffset.x;
	indexY += vTileOffset.y;
	Out.vPos.zw = float2(indexX, indexY);
	return Out;
}

struct GEOMETRY_OUT
{
	float4 vPos: SV_POSITION0;
	float2 vTex0: TEXCOORD0;
	float layer: TEXCOORD1;
};

[maxvertexcount(16)]
void mainGS(inout TriangleStream<GEOMETRY_OUT> outputStream, point VERTEX_OUT In[1])
{
	GEOMETRY_OUT Out;if(In[0].vPos.z  >= 0 && In[0].vPos.w >= 0 && In[0].vPos.z <= vTileHeight.y && In[0].vPos.w <= vTileHeight.z)
	{
		Out.vPos.w = 1.0f;
		float numLayers = vLayers.x;
		float numRealLayers = vLayers.z;
		float layerOffset = vLayers.y;
		for(float i = 0; i < numLayers; i++)
		{
			#if OPAGUE
			float curr = (numLayers - 1) - i;
			#else
			float curr = i;
			#endif
			Out.layer = curr + layerOffset;
			float2 tile = Tilemap.SampleLevel( PointClamp, float3(In[0].vPos.zw / vTileHeight.yz,(Out.layer + 0.5f)/numRealLayers),  0);
			uint packedTile = asuint(tile.x);
			if (packedTile != -1)
			{
				if(tile.y != 0.0f)
				{
					tile.y += vOffset.z;
}
				
				
				else
				{
					tile.y = 0.5f - 0.0002f;
}
				
				
				Out.vPos.z = 1.0f - (tile.y + curr * 0.0000001f);
				float2 vTex0 = calculateTileOffsets(packedTile, autotileData) * vTileHeight.wx;
				vTex0 += vInvTilesetSize.xy;
				Out.vPos.x = In[0].vPos.x;
				Out.vPos.y = In[0].vPos.y;
				Out.vTex0.x = vTex0.x;
				Out.vTex0.y = vTex0.y;
				outputStream.Append(Out);
				Out.vPos.x = In[0].vPos.x + vInvTileSize.x;
				Out.vPos.y = In[0].vPos.y;
				Out.vTex0.x = vTex0.x + vInvTileSize.z;
				Out.vTex0.y = vTex0.y;
				outputStream.Append(Out);
				Out.vPos.x = In[0].vPos.x;
				Out.vPos.y = In[0].vPos.y - vInvTileSize.y;
				Out.vTex0.x = vTex0.x;
				Out.vTex0.y = vTex0.y + vInvTileSize.w;
				outputStream.Append(Out);
				Out.vPos.x = In[0].vPos.x + vInvTileSize.x;
				Out.vPos.y = In[0].vPos.y - vInvTileSize.y;
				Out.vTex0.x = vTex0.x + vInvTileSize.z;
				Out.vTex0.y = vTex0.y + vInvTileSize.w;
				outputStream.Append(Out);
				outputStream.RestartStrip();
}
			
			
}
		
		
}
	
	
}
struct PIXEL_OUT
{
	float4 vColor: SV_Target0;
};

PIXEL_OUT mainPS(GEOMETRY_OUT In)
{
	PIXEL_OUT Out;
	Out.vColor = Image.Sample( PointClamp,  In.vTex0);
	Out.vColor.rgb += vColorTone;
	#if OPAGUE
	clip((Out.vColor.a <= 0.01f) ? -1 : 1);
	#endif
	if(activeLayer != -1.0f && In.layer != activeLayer)
	{
		Out.vColor.a = 0.25f;
}
	
	
	return Out;
}

(Its being auto-generated, so sorry for the formatting - but I doubt anybody will be able to fully understand it without further explanation).

----------------------------------------------------------------

So, does anybody see any issue with the code in guestion? Is there something on intel-chips you have to be aware with uint-datatypes and/or bitshifting? The shaders have been working until I introduced that packing, so I guess I must eigther be doing something shady or there is some bug in the intel drivers? From what I can tell, the output that I get on the Intel cards for the function in guestion is int2(1, 0), and is the same across the entire tilemap. Any ideas what going on there?

#edit: post editor hates me, adding links took way too many tries


This may be a stupid answer if so then I'm sorry. I just wanted to be sure that the PC with the problems supports Shader Model 4.0. Since i.e. asuint requires this.
I found these 2 posts in the Intel Forums which could be related.

Incorrect output post https://community.intel.com/t5/Graphics/Compute-shader-producing-incorrect-output/td-p/1103849

GLSL Problem: https://community.intel.com/t5/Graphics/GLSL-Shaders-issues/td-p/279960

“It's a cruel and random world, but the chaos is all so beautiful.”
― Hiromu Arakawa

Advertisement

Ultraporing said:
This may be a stupid answer if so then I'm sorry. I just wanted to be sure that the PC with the problems supports Shader Model 4.0. Since i.e. asuint requires this.

Yes, my own test-PC is running DirectX with SM5.0, 4.0 should be required for geometry-shader so I would expect failure to even run on a system that doesn't support it.

The 2 posts seem interesting, doesn't seem to be the same issue but it might be a bug in the driver after all which I'll try to report. I at least checked with the DirectX reference-renderer/device and fixed some warnings in the shader, doesn't seem to change anything. Doesn't seem to be related to optimization-settings, nor does the debug-runtime give any errors/warnings on the PC where the issue is reproducible.

At least I've found sort of a solution for the users the reported the issue, by forcing the application to always select the dedicated graphics-card when available (via the symbol-export; as not all of the users seem to support DXGI1.6).

Still, if anybody spots something else that looks wrong, let me know.

Ok, I found a solution, after posting it in the intel-forums with a sample-project (https://community.intel.com/t5/Graphics/HLSL-shader-issue-with-asuint-bit-packing/m-p/1427861/thread-id/110946#M110949)

Since somebody had an issue on his AMD-card as well, I went back to assuming something is wrong with the shader after all. As posted in the intel-thread, I changed to Sample of the tilemap-index texture to a Load instead, which is both simpler in the math and data required, and doesn't need a sample - which is how I came to that “solution”. I'm still not 100% sure what the actual issue/difference is. I guess it must have something to do with how I sampel from the 3d-texture? The sampler would have been:

Though now that its fixed, I only partially care about the reasons (insofar as to help me prevent similar issues in the future). So, if anybody has an idea on what would have been wrong here still, feel free to let me know. You can even try out the project that I submitted on the intel site, if you really care to look into it more deeply ?

Hmm, i guess using any filter, even if it's a point filter, does not guarantee to get bit exact floating point numbers from the original texture data.

Personally i would not even want to use a representation of integers in images. I would use ‘general memory buffers' instead, due to paranoia.
But interesting to see it actually works. : )

Juliean said:

The sampler would have been:

I tried your sample and it worked for my NVIDIA card. I did a bit of googling and found an example for the PointClamp state, and noticed when comparing yours with theirs that you use for AddressW WRAP instead of CLAMP like in their point wrap. No clue if it has something todo with it.
Sadly my experience with shader programming is almost non existent so I will stop posting here to not spam the topic too much.

Anyways here is a piece of the DirectXTK CommonStates and a link to the page, maybe it can be some inspiration.
Link: https://github.com/microsoft/DirectXTK/wiki/CommonStates

const float border[4] = { 0.f, 0.f, 0.f, 0.f };
float maxAnisotropy = (device->GetFeatureLevel() > D3D_FEATURE_LEVEL_9_1) ? 16 : 2;

// PointWrap
CD3D11_SAMPLER_DESC desc(D3D11_FILTER_MIN_MAG_MIP_POINT,
    D3D11_TEXTURE_ADDRESS_WRAP, D3D11_TEXTURE_ADDRESS_WRAP, D3D11_TEXTURE_ADDRESS_WRAP,
    0.f, maxAnisotropy, D3D11_COMPARISON_NEVER, border, 0.f, FLT_MAX);

// PointClamp
CD3D11_SAMPLER_DESC desc(D3D11_FILTER_MIN_MAG_MIP_POINT,
    D3D11_TEXTURE_ADDRESS_CLAMP, D3D11_TEXTURE_ADDRESS_CLAMP, D3D11_TEXTURE_ADDRESS_CLAMP,
    0.f, maxAnisotropy, D3D11_COMPARISON_NEVER, border, 0.f, FLT_MAX);

“It's a cruel and random world, but the chaos is all so beautiful.”
― Hiromu Arakawa

Advertisement

JoeJ said:
Hmm, i guess using any filter, even if it's a point filter, does not guarantee to get bit exact floating point numbers from the original texture data. Personally i would not even want to use a representation of integers in images. I would use ‘general memory buffers' instead, due to paranoia. But interesting to see it actually works. : )

Perhaps, yeah, even though from my understanding a point filter should have that guarantee. Its also not just a little off, by a few bits, but entirely a different number alltogether.

But its true, I could use a StructuredBuffer, I have the implementation of that in my engine now ? Back when I started that shader, I only had textures and cbuffers. I do have to submit a bit more data for manually calculating the index, but it could even be faster than the texture overall.

Ultraporing said:
I tried your sample and it worked for my NVIDIA card. I did a bit of googling and found an example for the PointClamp state, and noticed when comparing yours with theirs that you use for AddressW WRAP instead of CLAMP like in their point wrap. No clue if it has something todo with it.

I noticed that too, but I also wouldn't expect that to cause the issue. The calculated z-coordinate would have been (0.0f + 0.5f)/1.0f = 0.5f, which should not be in a range where it eigther needs to be wrapper or clamped. Unless the wrap-calculation causes some issues with the precision of the float that is being read? Idk really :D

Ultraporing said:
Sadly my experience with shader programming is almost non existent so I will stop posting here to not spam the topic too much.

Oh not at all, I appreciate all input.

Juliean said:
Perhaps, yeah, even though from my understanding a point filter should have that guarantee. Its also not just a little off, by a few bits, but entirely a different number alltogether.

If the number noticeable differs even if interpreted as float, that rule out my explanation about rounding issues, eventual lower precision in TMUs, etc.
And i agree point sampling ideally should return the original data, so scratch my explanation.

Juliean said:
but it could even be faster than the texture overall.

This question gave me a lot of uncertainty when i started with compute, so i did some comparisons (for my floating point data) back then (NV Fermi, Kepler, AMD GCN architectures).
I was assuming textures might have some advantages. But the differences were tiny enough to ignore them. So i've settled with using general buffers for everything which requires no filter and is no image.
But i'm still uncertain about this. You could get a slowdown too.

JoeJ said:
This question gave me a lot of uncertainty when i started with compute, so i did some comparisons (for my floating point data) back then (NV Fermi, Kepler, AMD GCN architectures). I was assuming textures might have some advantages. But the differences were tiny enough to ignore them. So i've settled with using general buffers for everything which requires no filter and is no image. But i'm still uncertain about this. You could get a slowdown too.

Yeah, GPU-performace is a hard topic. There is still a lot of things I'm uncertain about, too. For example, I have little idea if SampleLOD vs Load is actually better in my example, performancewise. I also did try to change the cbuffer for the animation-data to use a StructuredBuffer (since the amount of animated tiles can vary), but got a significant decrease in performance. Though in all honesty, the difference eigther way is neglible, the shader is able to render large tilemaps without issues eigther way :D Its just fun optimizing things, sometimes.

Juliean said:
I also did try to change the cbuffer for the animation-data to use a StructuredBuffer (since the amount of animated tiles can vary)

I can explain this one, at least.

Not familiar with DirectX terminology, but i guess cbuffer goes to the on chip ‘constant ram’ (same as Khronos ‘shader uniforms’).
The constant ram is filled when a new draw call / compute kernel is dispatched. It is then read only during execution, unlike writable ‘shared’ LDS memory available to compute shaders, which is on chip too.
On GCN, iirc LDS is 64kb per CU, and constant ram is 128kb.
StructuredBuffer (Khronos: Shader Storage Buffer) is on slow VRAM but of unlimited size and writable. Because it's not on chip it is much slower and goes through the cache hierarchy.

This makes clear why you want to use constant ram if you can.
A situation where this often is no longer possible is having too many lights or bone matrices, or bindless rendering techniques.

Juliean said:
I have little idea if SampleLOD vs Load is actually better in my example, performancewise.

I can only guess: Load should be faster or equally fast, because it causes less work (or bypasses completely) the TMU. So i would not worry, and if you don't see a difference it should be fine on any GPU.

Further i guess Load turns texture memory access into the same as a general memory access we see with StructuredBuffer.
But yeah, not sure. VRAM memory access, related pipelined execution, caching, etc., is where my knowledge is bad.
Otherwise GPU performance is easier to predict and understand than CPU perf. to me, because there is no branch prediction, speculative execution and such black boxes.

This topic is closed to new replies.

Advertisement