Issue with packed uints in shader on Intel HD-Chips

Julian Watzinger · 2022-11-07T08:15:04

Hello, I'm experiencing issues with my tilemap-shader, on a wide range of "Intel HD"-GPU chipsets. The issue was introduced with, and seems to stem from me packing the coordinates of the tilemap-lookup in one float in a texture. The algorithm: const auto makePackedTile = [&](uint16_t offX, uint16_t offY, uint8_t autotileId) { AE_ASSERT(autotileId <= 8); uint32_t packed = autotileId; AE_ASSERT(offX <= 512); AE_ASSERT(offY <= 512); packed |= offX << 4; packed |= offY << 18; AE_ASSERT((packed & 0x0F) == autotileId); AE_ASSERT(((packed >> 4) & 0x3fff) == offX); AE_ASSERT((packed >> 18) == offY); return std::bit_cast<float>(packed); }; As validated by the asserts at the end, this should theoretically yield the correct results (and it does so on any other GPU). This is being written to a 32-bit floating point two-component texture (RGF32), with an additional z-value in the other component. The shader-code for reading the data: int2 calculateTileOffsets(uint packedTile, float autotileData[9]) { uint autotileId = packedTile & 0x0F; int2 vOffset; vOffset.x = (packedTile >> 4) & 0x3fff; vOffset.y = autotileData[autotileId]; vOffset.y += packedTile >> 18; return vOffset; } // usage: float2 tile = Tilemap.SampleLevel( PointClamp, float3(In[0].vPos.zw / vTileHeight.yz,(Out.layer + 0.5f)/numRealLayers), 0); uint packedTile = asuint(tile.x); if (packedTile != -1) { Out.vPos.z = 1.0f - (tile.y + curr * 0.0000001f); float2 vTex0 = calculateTileOffsets(packedTile, autotileData) * vTileHeight.wx; I've been able to track the problem down to that function and the data extracted from “packedTile”. If I replace the evaluation of vOffset.x/y with static values, those tiles are then rendered. Also, when I write arbitrary float-values into the texture, and use them in place of “tile.x”, the result is the same, meaning the sample of the actual texture should work. Also, I've checked that the state being bound and the content of all the textures and cbuffers are the same on the PC where the shader works, and the one where it doesn't. Just in case you wonder, here's the complete shader (which uses a geometry-shader): #pragma pack_matrix( row_major ) #pragma ruledisable 0x0802405f cbuffer Instance: register(b3) { float4 vOffset; float4 vTileOffset; float tileSize; float4 vInvTileSize; float4 vTileHeight; float3 vLayers; float activeLayer; float4 vInvTilesetSize; float4 vColorTone; }; cbuffer Custom0: register(b4) { float autotileData[9]; }; Texture3D Tilemap: register(t0); Texture2D Image: register(t1); sampler PointClamp : register(s0); int2 calculateTileOffsets(uint packedTile, float autotileData[9]) { uint autotileId = packedTile & 0x0F; int2 vOffset; vOffset.x = (packedTile >> 4) & 0x3fff; vOffset.y = autotileData[autotileId]; vOffset.y += packedTile >> 18; return vOffset; } struct VERTEX_IN { float posIndex: SV_POSITION0; }; struct VERTEX_OUT { float4 vPos: SV_POSITION0; }; VERTEX_OUT mainVS(VERTEX_IN In, uint VertexId : SV_VertexID, uint InstanceId : SV_InstanceID) { VERTEX_OUT Out; uint offset = vOffset.w; float indexX = VertexId % offset; float indexY = floor(VertexId / offset); Out.vPos.x = indexX * tileSize; Out.vPos.y = indexY * tileSize; Out.vPos.xy += vOffset.xy; Out.vPos.x = Out.vPos.x * vTileOffset.z- 1.0f; Out.vPos.y = -Out.vPos.y * vTileOffset.w + 1.0f; indexX += vTileOffset.x; indexY += vTileOffset.y; Out.vPos.zw = float2(indexX, indexY); return Out; } struct GEOMETRY_OUT { float4 vPos: SV_POSITION0; float2 vTex0: TEXCOORD0; float layer: TEXCOORD1; }; [maxvertexcount(16)] void mainGS(inout TriangleStream<GEOMETRY_OUT> outputStream, point VERTEX_OUT In[1]) { GEOMETRY_OUT Out;if(In[0].vPos.z >= 0 && In[0].vPos.w >= 0 && In[0].vPos.z <= vTileHeight.y && In[0].vPos.w <= vTileHeight.z) { Out.vPos.w = 1.0f; float numLayers = vLayers.x; float numRealLayers = vLayers.z; float layerOffset = vLayers.y; for(float i = 0; i < numLayers; i++) { #if OPAGUE float curr = (numLayers - 1) - i; #else float curr = i; #endif Out.layer = curr + layerOffset; float2 tile = Tilemap.SampleLevel( PointClamp, float3(In[0].vPos.zw / vTileHeight.yz,(Out.layer + 0.5f)/numRealLayers), 0); uint packedTile = asuint(tile.x); if (packedTile != -1) { if(tile.y != 0.0f) { tile.y += vOffset.z; } else { tile.y = 0.5f - 0.0002f; } Out.vPos.z = 1.0f - (tile.y + curr * 0.0000001f); float2 vTex0 = calculateTileOffsets(packedTile, autotileData) * vTileHeight.wx; vTex0 += vInvTilesetSize.xy; Out.vPos.x = In[0].vPos.x; Out.vPos.y = In[0].vPos.y; Out.vTex0.x = vTex0.x; Out.vTex0.y = vTex0.y; outputStream.Append(Out); Out.vPos.x = In[0].vPos.x + vInvTileSize.x; Out.vPos.y = In[0].vPos.y; Out.vTex0.x = vTex0.x + vInvTileSize.z; Out.vTex0.y = vTex0.y; outputStream.Append(Out); Out.vPos.x = In[0].vPos.x; Out.vPos.y = In[0].vPos.y - vInvTileSize.y; Out.vTex0.x = vTex0.x; Out.vTex0.y = vTex0.y + vInvTileSize.w; outputStream.Append(Out); Out.vPos.x = In[0].vPos.x + vInvTileSize.x; Out.vPos.y = In[0].vPos.y - vInvTileSize.y; Out.vTex0.x = vTex0.x + vInvTileSize.z; Out.vTex0.y = vTex0.y + vInvTileSize.w; outputStream.Append(Out); outputStream.RestartStrip(); } } } } struct PIXEL_OUT { float4 vColor: SV_Target0; }; PIXEL_OUT mainPS(GEOMETRY_OUT In) { PIXEL_OUT Out; Out.vColor = Image.Sample( PointClamp, In.vTex0); Out.vColor.rgb += vColorTone; #if OPAGUE clip((Out.vColor.a <= 0.01f) ? -1 : 1); #endif if(activeLayer != -1.0f && In.layer != activeLayer) { Out.vColor.a = 0.25f; } return Out; } (Its being auto-generated, so sorry for the formatting - but I doubt anybody will be able to fully understand it without further explanation). ---------------------------------------------------------------- So, does anybody see any issue with the code in guestion? Is there something on intel-chips you have to be aware with uint-datatypes and/or bitshifting? The shaders have been working until I introduced that packing, so I guess I must eigther be doing something shady or there is some bug in the intel drivers? From what I can tell, the output that I get on the Intel cards for the function in guestion is int2(1, 0), and is the same across the entire tilemap. Any ideas what going on there?

Graphics and GPU Programming Programming

Started by Juliean November 04, 2022 02:07 PM

21 comments, last by DukeThrust 2 years, 2 months ago

Juliean

Author

7,378

November 06, 2022 11:30 AM

JoeJ said:
This makes clear why you want to use constant ram if you can. A situation where this often is no longer possible is having too many lights or bone matrices, or bindless rendering techniques.

Thanks, that makes things a lot more clear. Then the solution here is just to make the cbuffer as large as supported. The number of animated tiles is never going to become really large. I suppose I could use a shader-switch to change the number of supported elements in increments. Do you happen to also know if there is an overhead in having a large cbuffer, lets say 4096-float4 if only 8 of those are currently in use?

JoeJ said:
Further i guess Load turns texture memory access into the same as a general memory access we see with StructuredBuffer. But yeah, not sure. VRAM memory access, related pipelined execution, caching, etc., is where my knowledge is bad. Otherwise GPU performance is easier to predict and understand than CPU perf. to me, because there is no branch prediction, speculative execution and such black boxes.

Interesting, I find CPU-performance much easier to understand. Especially after learning ASM for my JIT-compiler, but even before :D True, there are systems in the background that you don't control directly, but the basics of whats faster seem much clearer to me - use less memory, access memory in a linear fashion, precompute results (as long as it doesn't violate the former two), cache expensive calls in local variables, etc… For GPU, for me, even if I know whats basically the right thing, I sometimes find it hard to execute - MAD-instructions, as an example. The one class we had on (CUDA) compute-shader, where the guy explained how to optimize the performance of a shader by x128 just be changing the way that memory is accessed and how the batches are executed, I couldn't execute myself :D Maybe with a bit more experience -I never got around to implementing compute-shaders in my engine so far, not really all that important for 2d graphics which I'm working on right now.

Ultraporing

114

November 06, 2022 12:10 PM

Juliean said:
Ultraporing said:
I tried your sample and it worked for my NVIDIA card. I did a bit of googling and found an example for the PointClamp state, and noticed when comparing yours with theirs that you use for AddressW WRAP instead of CLAMP like in their point wrap. No clue if it has something todo with it.
I noticed that too, but I also wouldn't expect that to cause the issue. The calculated z-coordinate would have been (0.0f + 0.5f)/1.0f = 0.5f, which should not be in a range where it eigther needs to be wrapper or clamped. Unless the wrap-calculation causes some issues with the precision of the float that is being read? Idk really :D
Ultraporing said:
Sadly my experience with shader programming is almost non existent so I will stop posting here to not spam the topic too much.
Oh not at all, I appreciate all input.

Since my intuition doesn't leave me alone and had a similar issue like the following which cost me nearly a week of debugging at an old job, I'll just throw this out there.

Did you test some simple division and calculations like in your Broken sample as standalone test with known values and result on both machines? The Intel one and the one it runs fine on, to verify the consistency between the machines and operations used by both hardware's.
Since I had a problem a few years back with purely CPU processed code (20 years old legacy software and my personal hell) of packed floats values into INTs. And subsequent division operations on different hardware. It ran fine on my new laptop with win10 but It would not work on a clients machine and do funky stuff like turning after the division of a value which should have been i.e. 0.8f into 0f. I think I fixed it by removing the division operations and only using multiplication.

My stomach tells me it's some banal incompatibility with unpredictable results due driver/OS/hardware (maybe even firmware) changes in the background like you suggested.

Good luck

“It's a cruel and random world, but the chaos is all so beautiful.”
― Hiromu Arakawa

Juliean

Author

7,378

November 06, 2022 12:16 PM

@ultraporing Yeah, I suppose it would go into that direction. But since my fix works and is an improvement overall for the shader, I don't really feel like investigating it much further. Especially since the test-machine is slow AF. But thanks for all the input, if I ever run into a similar issue (or feel like it), I'll probably investigate into that direction ?

JoeJ

4,369

November 06, 2022 01:19 PM

Juliean said:
Do you happen to also know if there is an overhead in having a large cbuffer, lets say 4096-float4 if only 8 of those are currently in use?

Hmmm, i can only draw conclusions…

Assume multiple different shaders, each having different cbuffers, would run on the same CU.
In this case using large cbuffers would reduce occupancy, because when exceeding the physical limit, multiple shaders could not run on the same CU.

But i have never heard of constant ram being a factor for occupancy. It's just registers and LDS. So i would conclude different shaders never run on the same CU at the same time at all, and thus we can use all ram which is there without worries.

However, this can't be the whole story. Because then all GPUs would have the same exact size of constant ram as defined by API limits, because having more ram would be pointless then. And i doubt that's the case.

So, after all that thinking… i don't know :D But i'm very sure there is no good reason to minimize constant ram usage in general. Much more likely you want to ‘utilize’ it.

Juliean said:
use less memory, access memory in a linear fashion, precompute results

That's the same on GPU too, but there are more related details and it's more complex.

On CPU you're good with reading like so:

a[0]
a[1]
a[2] ...

On GPU, assuming it has only 2 threads executing in lockstep, and writing the threads horizontally and execution order vertically, the ideal pattern is this:

a[0] | a[1]
a[2] | a[3]

But not this, which is worse:

a[0] | a[2]
a[1] | a[3]

Though, you can care about this only with compute shaders. For PS, VS, etc. the second option is ofc. all you can do.

Also, on GPU you want to ‘manually pre cache’ your data if you access it more often than once and from different threads in inner loops.
So you may read a block of data from VRAM to LDS,
then do your processing stuff only using LDS, and you may write your results to another LDS array with random access,
then, when you're done you write back your results from LDS to VRAM with (ideally) linear access.

So there's lots of extra effort to deal with memory most efficiently. And the goal always is to have no VRAM access in inner loops.
Again, that's CS only, and depends on algorithm ofc.

Juliean said:
cache expensive calls in local variables

I have not worked for years on GPU and don't know how compilers have improved, but on GPU you usually want to inline everything, so no little helper functions at all.
Almost nobody does this, but it's true. When optimizing a shader, the first impression can be that having sub functions even helps, but after being done with optimizing inlining all the code always ended up faster to me.
So if hlsl has pragmas not only to roll and unroll, but also to inline, try it out for low effort gains. Vulkan had neither of these when i worked on it.

Caching to local variables itself also is more pain, because it may increase your register count by one, eventually reducing the occupancy tier. So i do a lot of bit packing on GPU, and also use LDS for the cache instead local variables.

Juliean said:
I never got around to implementing compute-shaders in my engine so far, not really all that important for 2d graphics which I'm working on right now.

Probably.
It's a pain in the ass. The code is hard to maintain, HW details affect ideal choices on algorithms, languages are meh, etc.

But i really love the concept of parallel algorithms. The programming process itself becomes interesting and rewarding, even for simple stuff. It's something else and new.

But sadly, current APIs are not cross platform, or outdated, or dying, and low level. So the power of GPUs is not available to the general programmer. I have plenty of experience, but i still shy away from porting any of my (slow) preprocessing tools to compute for those reasons.
It's really a pity, and imo the largest failure of the tech industry to come up with a proper standard still after so many years. : (

Juliean

Author

7,378

November 06, 2022 01:29 PM

JoeJ said:
So if hlsl has pragmas not only to roll and unroll, but also to inline, try it out for low effort gains. Vulkan had neither of these when i worked on it.

From my understanding, there are no functions after compilation of a shader is done. Everything will always be inlined. There is no “call” instruction as there is on the GPU, so inlining functions is a necessity. That to me usually means that having a sub-routine on the GPU is purely a matter of whether it makes the code more readable or not.

JoeJ said:
Probably. It's a pain in the ass. The code is hard to maintain, HW details affect ideal choices on algorithms, languages are meh, etc. But i really love the concept of parallel algorithms. The programming process itself becomes interesting and rewarding, even for simple stuff. It's something else and new. But sadly, current APIs are not cross platform, or outdated, or dying, and low level. So the power of GPUs is not available to the general programmer. I have plenty of experience, but i still shy away from porting any of my (slow) preprocessing tools to compute for those reasons. It's really a pity, and imo the largest failure of the tech industry to come up with a proper standard still after so many years. : (

Yeah, thats also an argument for why I haven't started yet. Since my engine is (theoretically) cross-platform, I would want have a way to only write computer-shaders once, like I do now with my own meta-shader language. But that requires a lot more time than it would even just for getting started, since I'd have to look at the differences between the languages, and I'm not even sure if this is technically possible to get good cross-platform results for compute-shaders; as you suggested its probably not.

Ultraporing

114

November 06, 2022 02:46 PM

Juliean said:
Yeah, thats also an argument for why I haven't started yet. Since my engine is (theoretically) cross-platform, I would want have a way to only write computer-shaders once, like I do now with my own meta-shader language. But that requires a lot more time than it would even just for getting started, since I'd have to look at the differences between the languages, and I'm not even sure if this is technically possible to get good cross-platform results for compute-shaders; as you suggested its probably not.

If you decide to implement Compute Shaders sometime in the future Heterogeneous-Computing Interface for Portability (HIP) looks promising.
With this you would only need to Translate your Meta Language to HIP, and from there you can just use HCC (AMD) or NVCC (NVIDIA) compilers.

“It's a cruel and random world, but the chaos is all so beautiful.”
― Hiromu Arakawa

Juliean

Author

7,378

November 06, 2022 03:35 PM

Ultraporing said:
With this you would only need to Translate your Meta Language to HIP, and from there you can just use HCC (AMD) or NVCC (NVIDIA) compilers.

Oh, the thing with me is, I don't use external libraries at all, unless virtually required (like DX/GL) :D I know this is (objectively) stupid, writing your own implementation for things like the (awesome sounding) library you posted, but thats just how I roll :D After all, my project(s) are just for fun, and I have great fun reinventing the wheel like that, always have. Though I might take a look at the library for some inspiration if I ever end up tackling that feature, so thanks for posting after all ?

JoeJ

4,369

November 06, 2022 04:41 PM

Juliean said:
From my understanding, there are no functions after compilation of a shader is done. Everything will always be inlined. There is no “call” instruction as there is on the GPU, so inlining functions is a necessity. That to me usually means that having a sub-routine on the GPU is purely a matter of whether it makes the code more readable or not.

Oops, agree about the missing call, which makes my expectation on getting an inline pragma pretty stupid, admitted.
I don't know the reason of the problem, never compared any ISA output. In fact it should be no difference, generating the same output no matter if we use functions or not. But i got higher register usage from using functions so often, at some point i stopped to use them almost in general.
I really hope this has been ‘fixed’ since then.

Optimizing for register usage is black art for other reasons too. Slight syntax changes which have no expected effect often do have a big effect.
It went so far, once i have improved performance by a factor of two by including a pointless branch like this: if (localThreadID < 256). The branch is always true on any device, but it still motivated the compiler to reduce register usage by something like 10.
That's almost a compiler bug, producing correct code but bad performance, fixed by writing really bad code.
Again i hope that's much better now than 5 years ago.

Juliean said:
and I'm not even sure if this is technically possible to get good cross-platform results for compute-shaders

For this case i would not expect problems.
Personally i have used the same shader code for OpenCL 1.2 and Vulkan (GLSL). I did not write some transpiler tools myself, but used a regular C preprocessor and some defines to deal with different syntax.
There is little difference between OpenCL 1.2 and compute shaders, and i assume it's even much less for compute shaders across GLSL / HLSL / Metal.
So you would only need to figure out the subtle differences in syntax to make generated code for multiple APIs. It should not be any harder than with pixel shaders.

For general GPU usage i would just prefer a simple API with a C alike language and features such as indirect dispatch and device generated dispatch. Something like OpenCL 2.0 or Cuda.
I don't want to use a bloated gfx API for that. Although, that's currently the best option regarding cross platform.
Maybe SYCL gets wide support, now that Intel builds on it. Otherwise my only hope is future C++.

Juliean

Author

7,378

November 06, 2022 05:00 PM

JoeJ said:
I don't know the reason of the problem, never compared any ISA output. In fact it should be no difference, generating the same output no matter if we use functions or not. But i got higher register usage from using functions so often, at some point i stopped to use them almost in general.

Note sure eigther, maybe there are subtle differences between the code you wrote with the function and without. While the forced-inlining might sound like an advantage, it could turn out to be a disadvantage too, if the function is large, used in many places, maybe inside a loop. This could negatively affect register-usage, and the overhead of a “call” (at least on traditional cpu) could be far less then the increased code-size etc…

JoeJ said:
Optimizing for register usage is black art for other reasons too. Slight syntax changes which have no expected effect often do have a big effect. It went so far, once i have improved performance by a factor of two by including a pointless branch like this: if (localThreadID < 256). The branch is always true on any device, but it still motivated the compiler to reduce register usage by something like 10. That's almost a compiler bug, producing correct code but bad performance, fixed by writing really bad code. Again i hope that's much better now than 5 years ago.

Yeah, exactly, that where I totally lack understand on the GPU so far. I'll have to include a disassembler in my engine sometimes, so that I can start looking at the generated ASM, now that I probably understand a bit better whats going on. Though FWIW, I should also mention that for writing shaders for ingame effects, I tend to prefer using a visual interface (like shader-graph etc…) which leaves less room to optimize in that regard as well :D Only really complicated shaders, like the tilemap from this thread, I write in my cross-platform language.

JoeJ said:
Personally i have used the same shader code for OpenCL 1.2 and Vulkan (GLSL). I did not write some transpiler tools myself, but used a regular C preprocessor and some defines to deal with different syntax.

I'm not even sure where my tool (for shaders currently) would even fit in :D Its definately more complicated than a preprocessor, but I'm not sure if it counts as a transpiler. I've basically defined a JSON-esque interface, where blocks are then translated to the respective language-equivalent. There's probably some old blogpost floating around where I showcase this. In reality, I didn't update the OpenGL-backend in a very long time, since I didn't really need to so far. Only if I ever go for android/iOS-support, I suppose I'd have to deal with that again.

JoeJ said:
For general GPU usage i would just prefer a simple API with a C alike language and features such as indirect dispatch and device generated dispatch. Something like OpenCL 2.0 or Cuda. I don't want to use a bloated gfx API for that. Although, that's currently the best option regarding cross platform. Maybe SYCL gets wide support, now that Intel builds on it. Otherwise my only hope is future C++.

That would be the least of my worries (I hope). I have a pretty well-rounded, platform-agnostic wrapper for GFX-stuff so far, and adding the capabilities to execute compute-shaders that way should (hopefully) be easy. I have to say, I'm getting intrigued trying it out. Though I have to work more on the game right now. I'll add it to the list with technical things that I want to try (like DX12 or Android-support) for the future :D

JoeJ

4,369

November 06, 2022 05:58 PM

Ultraporing said:
If you decide to implement Compute Shaders sometime in the future Heterogeneous-Computing Interface for Portability (HIP) looks promising.

HIP so far is AMD only, and can't run on Windows due to some restrictions of the Windows driver model, afaik. So it's AMD + Linux only, and thus even worse than Cuda.
This also is one reason SYCL is not yet widely supported, as AMD (or somebody else?) implements it on top of HIP, afaik. SYCL is modern C++ for GPUs, so quite interesting…

But it's all vendor specific, even if those vendors declare their standards to be open for others. OpenCL tried to solve this, but NV refused to implement 2.0 on consumer drivers.
NV now does support OpenCL 3.0, but 3.0 is actually a step back, making the advanced features of 2.0 (device side enqueue) optional, which NV still does not implement.
AMD has recently discontinued OpenCL support for their CPUs. So OpenCL is basically a dying API of decreasing support. CL 1.2 still has wide support, but is outdated, lacking both indirect dispatch and device command lists.

Microsofts AMP seems abandoned too, but this was Windows only anyway.

So the options are:
NV: Cuda
Intel: SYCL
AMD: HIP (Linux only)

We can't use any of this. But for games we don't really want to anyway. For games we better use the Compute options of our selected gfx API, because only then we have fine grained control over async execution and robust interop with gfx.
OpenGL or Vulkan has good cross platform support, so our situation isn't bad. Only problem is that the APIs do not evolve. We lack an alternative to device side enqueue or Cudas dynamic parallelism and there is no sign of progress.

Still, for game devs the situation is ok. It only sucks we depend on some API which might be outdated after some years, forcing us to port our code just to keep up.
But for anyone else the burden to deal with complex gfx APIs and using silly ‘shading languages’ to write general purpose code is quite big.

Issue with packed uints in shader on Intel HD-Chips

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Issue with packed uints in shader on Intel HD-Chips

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines