Advertisement

Is an IF always a bad thing?

Started by June 19, 2018 03:18 PM
12 comments, last by JoeJ 6 years, 7 months ago

For example:

[loop]
while (i < thread.x) {
  [branch]
  if (something) {
    ... a lot, a ton of code here
    ... not reading any kind of memory, only computations
    ... write to destination the result of the computations
  }
  i++;
}

I could remove the IF and add some non-branching logic just before leaving the while block. This logic would mask out the unused result and write only the useful result to the destination.
It is the same result.

In my code it is very improbable that all the lanes choice the same IF. But if sometimes it happens, even if it is very rare, it will skip a ton of code.
And even if it never happens that all the lanes choice the same IF path, I would prefer to stale part of the silicon instead of making it doing something absolutely completely useless.
Would stalling part of the lanes inside a wave at least save power, and heat less the GPU?
What if inside the skipped block, we have instructions that take different time to compute. AFAIK division and square root can take different time depending on the data. Isn't better to avoid those using an IF?

I read too somewhere that something costly, apart of divergence, happens at the GPU with IFs, some king of stacking of decisions, that is taking time. This could change the whole picture.

I know documentation even recommends sometimes to re-compute things, but I just don't feel comfortable with the concept of doing something absolutely useless. This is one of the things I can not get along with in parallel computing yet.

Sometimes, when no big code is going to be skipped, one IF seems to take only two operations, while trying it hard to avoid branching by using bitwise logic is taking more than 5 operations. And I don't know how many operations those step() and sign() functions take. Sometimes people read twice from a texture and then use step to mix the result, only to avoid branching, but they added a fetch from memory extra.

Doesn't new hardware(built in the last three years) handle IFs in a better way?

2 hours ago, NikiTo said:

Doesn't new hardware(built in the last three years) handle IFs in a better way?

GPU multithreading is not same as CPU multithreading. Says better it is scalable vectoring instead of multitreading. All cores processing same algo but wih different source data, and do it syncroniusly - all cores make same step at same time, so GPU conceptually can not skip branches, becouse it broke a steps synchronhization.  GPU computes both brances, but mask output for threads for wich branch have be skiped by programm logic.

#define if(a) if((a) && rand()%100)

Advertisement

Whether or not Ifs are costly depends on how the architecture handles branches. As Fulcrum.013 has pointed out, the highly parallelized nature of a GPU precludes branch prediction. The GPU is way too tuned and way too busy to really care. The top answer in that link says that with newer hardware it's not something to worry about: the GPU is running so many threads that if a branch fails, the core just goes to some other thread. Not to mention that shader code is rarely 'branchy'. Long branches might be problematic. It would be better to organize your rendering ahead of time to separate rendering into discrete operations if you are worried or the profiler says so.

My code example is having only one branch. It has no ELSE. I don't see how "GPU executes always both of the sides of a branch" can apply here.

I have read, that GPU has an execution mask. At the moment of finding an IF, the GPU stacks the entry point of the ELSE statement if one is present, and the exit point where the whole IF/ELSE ends. And only the masked with TRUE lanes from the warp execute something. The masked with false lanes execute NOPs. Then if there is an ELSE, the decision mask is NOTed and the warp executes the ELSE. Then the execution mask is put again to all TRUE and GPU pulls from the stack the pointer to the place where IF/ELSE exits.

For my initial example, the NOP and decision mask make much sense. Because it is logically impossible, in my example a GPU to execute the staling lanes too and mask the output, because the GPU has no way to know what is the output. I could have changed some important variable in the middle of the block. It would be complex for a compiler to watch over all this. At least much more complex than using the execution mask and NOPs.

The programmer programming a code where everything is forcedly parallel. Like the example I gave of computing all in code anyways and then using bitwise logic to discard the result is much worse in my opinion than an IF. Because documentation says that when there is more than one warp per CU, the hardware can execute instead staling. Like executing both of the sides(if there is an ELSE) using two warps if one of the warps is staling waiting data to be fetched from slow memory.

I think IFs are not something that should be radically avoided even at the cost of multiple reads from slow memory.

I could avoid IFs by appending to buffers only the used data for the next pass, or I could compute two passes in one, making some lanes stale. It is arguable which is better. For my previous pipeline, I was using Stencils to fulfil the IF logic, but writing to a stencil and reading back and the operations AMD does in shaders(not in hardware) to sort approved from unapproved pixels could take the same time. If the algorithm asks for an IF, I don't see a way to completely remove that IF. Somebody somewhere in the hardware, has to pay for that IF.

I staled too long as a programmer trying to completely remove all IFs. I need to go further.

On 6/19/2018 at 5:18 PM, NikiTo said:

This is one of the things I can not get along with in parallel computing yet.

I give you a different example. Say we have n-body problems of different size between 64 and 512.

The most efficient way to process them would be to use the same algorithm but with different workgroup sizes of 64, 128, 256, 512.

Then you sort the problems to the proper workgroup size so a problem of size 200 is processed by workgroup of size 256, resulting in a need to dispatch 4 shaders instad just one of the largest size (512). With enough work the additional overhead will pay out.

That's all fine, but on average, still only 75% of lanes will have work. There's nothing you can do about it. You have to accept it, it can't be done any better. 

 

I've often tried to implement things like a queue inside a workgroup to keep all lanes busy, but rarely it was a win over a simpler algorithm where some lanes work much longer than others, and if it was a win, the win was only a fraction of what i've hoped for.

 

On 6/19/2018 at 5:18 PM, NikiTo said:

I read too somewhere that something costly, apart of divergence, happens at the GPU with IFs, some king of stacking of decisions, that is taking time. This could change the whole picture.

I've read this too (i think it's mostly about register cost to maintain control flow), but in practice you can't choose anyways. 

I've notice it is definitively worth to if out memory access, even from LDS. (may have been differnt a decade ago.)

On 6/19/2018 at 5:18 PM, NikiTo said:

I know documentation even recommends sometimes to re-compute things, but I just don't feel comfortable with the concept of doing something absolutely useless.

If recomputing saves registers, it will be worth it eventually (but often the compiler alters your decisions.)

On 6/19/2018 at 5:18 PM, NikiTo said:

Doesn't new hardware(built in the last three years) handle IFs in a better way?

Lets say you have workgroug size of 256, but only 180 lanes are active. In this case the last wavefront may be able to skip execution. If this truly happens, or if it is even more fine grained (thinking of SIMD units processing only 16 waves), that i do not know, but it may work on some (or at least on future) GPUs. So i try to utilize this.

Personally i think using IFs to save work is always good, and avoiding IFs never made much sense. Maybe more sense on early GPUs, but what is meant by all this is just: Be aware lanes operate in lockstep.

 

39 minutes ago, NikiTo said:

My code example is having only one branch. It has no ELSE. I don't see how "GPU executes always both of the sides of a branch" can apply here.

You're right.

39 minutes ago, NikiTo said:

For my previous pipeline, I was using Stencils to fulfil the IF logic

(Skipped some things being too technically for me quickly, but i don't think they are important anyways.)

So by using the stencil you utilized some higher level mechanism to pack work together (pixel quads where all stencil is zero will be skipped, but if only one pixel is nonzereo, other lanes will have no work).

And this is exactly what you should do: Trying to pack work so similar lengthy workloads likely end up in the same wavefronts. (but also pack it so nearby threads access nearby memory, which can contradict each other.)

Avoiding IFs has surely no priority, the advise seems outdated.

 

1 hour ago, JoeJ said:

There's nothing you can do about it. You have to accept it, it can't be done any better. 

yeah

Advertisement

I remember some related example that gave me wonders:

Each thread has to process a number of nodes between 0 and 4 (but this number differs only by 1 between all threads in a workgroup, e.g. most threads process 4 nodes, but some latter threads only 3)

Within this outer loop each node has some common work to process like loading its data, and depending if it is a leaf or not some conditional work (so if AND else, but no inner loops, just some instructions).

I've also implemented this algorithm in a different way: One outer loop to process all interior nodes, followed by a second outer loop for the leafs. Notice in this case there will be more idle threads, the program is almost twice as long and previously common work is processed twice.

I expected the first approach to be faster. It was faster on Nivida Vulkan, but slower on AMD Vulkan. On AMD OpenCL also the first approach was faster.

The difference in performance is at least enough so it is worth to keep maintaining both branches. (My code is full of such ifdefs and completely unreadable for those reasons, but this is how i optimize for different hardware.)

I do not understand how the second approach can be faster although it does twice the work!

Assumptions:

* Distributing the loads from memory may help to prevent bandwidth peaks

* First Approach becomes too complicated, needs more registers and occupancy decreases (could check such things only with OpenCL)

* Or the compiler acts somehow suboptimal (yeah, it must be this! It's always this!)

In any case it was just one out of many examples that teached me: Keeping all threads busy is not as important as you think (and i still refuse to believe this lesson :) )

Also important: You can not really learn so much from such special cases. In the next shader just the opposite may happen.

 

On 6/19/2018 at 10:39 AM, Fulcrum.013 said:

GPU multithreading is not same as CPU multithreading. Says better it is scalable vectoring instead of multitreading. All cores processing same algo but wih different source data, and do it syncroniusly - all cores make same step at same time, so GPU conceptually can not skip branches, becouse it broke a steps synchronhization.  GPU computes both brances, but mask output for threads for wich branch have be skiped by programm logic.

This is not accurate. GPU's can absolutely use true flow control operations, with the caveat that the flow control is coherent across a group of threads that execute in lockstep. Modern GPU's generally use SIMD hardware that's anywhere from 8-wide to 64-wide, and require the branch condition to be uniform across the whole SIMD to be able to actually take the branch. GPU's only have to resort to lane masking and predication when the result of the branch condition is different across a group of threads on the same SIMD unit.

In summary, whether or not a branch/loop actually skips instructions depends on your condition and your grouping of threads. For instance if you're branching in a pixel shader, you'll want to to make sure that the branch condition will be same across neighboring pixels in the same area of the screen. Or if you branch on a value from a constant buffer that's not dynamically indexed, you can know for sure that all of your threads will take the same path.

On 6/19/2018 at 7:39 PM, Fulcrum.013 said:

GPU multithreading is not same as CPU multithreading.

It's worth to mention (again) async compute, which is more similar to CPU multithreading.

Here we use multiple queues (instead 1 or 2 threads per core), and if the GPU has support the work will execute in parallel.

To join the individual workloads, there are synchronization commands for the various APIs.

The downside is a pretty high cost coming from the overhead of using multiple queues and sync, and the need to divide a single command list into multiple command lists. The cost is higher than a simple memory barrier within a single queue. (This is where i see the most need to improve current APIs / drivers.)

With Vulkan and FuryX i noticed only the graphics / compute queue offers full performance. The other 3 compute queues seem to be limited to utilize only half of the GPU. (Which is undocumented, and because i've initially used only the latter queues for my tests, i've got only disappointing results. Reason why i post this again and again...)

At the end i've got close to optimal results with my tests (but i still missing real world stuff experience). Because all this seems very hardware or API dependent, it's a good reason to have some node based abstraction on top of APIs, so it's easy to experiment and find good configurations.

 

But there is also an easier way to utilize async compute. If you have dispatches that do not depend on each others results (no memory barriers), you can (and should) execute them in the same queue, and the GPU will run them in parallel automatically without any downsides. The N-body example from above with its 4 dispatches is an example of this.

(I don't know if any of this might work on Nvidia GPUs.)

 

 

 

 

Related to the IF/ELSE topic is the size of the warp too. In NV it is 32 witch should make it exactly twice better than 64 of AMD for SIMD-divergent code. Still for some reason I prefer AMD, this comes from always in my case. I just always bought only AMD/ATIs and Intel. I am an "Intel/ATI guy" for some reason.
 

This topic is closed to new replies.

Advertisement