If you flatten a branch then there will never be any divergence. If you check the compiler output, it will typically generate some instruction sequence where both sides of the branch are evaulated, and then the correct result is selected using a conditional move. If you force a branch instead, the compiler will emit actual branching instructions would could result in divergence within your warp/wave. If there's divergence then performance and number instructions executed will be similar to the flattened case, but it will execute differently on the actual hardware (typically GPUs will have a thread mask that they set appropriately at a branch point, which causes instructions from masked threads to have no effect). The main difference is that if all threads in the warp/wave take the same path in the branch, then an actual branch can skip over one side of the branch.
To be honest, the flatten attribute is a bit of leftover from the earlier days of programmable GPUs. Initially they didn't really have true branching at all, and then later on branching tended to be quite slow and so you only wanted to use it in cases where it really made a difference. These days that's not really the case, although there could still be some small perf differences between a flattened branch and true branch that's divergent, since the latter might still have a bit of overhead from setting up masks and issuing the branch instructions. Flattened branches can also be inefficient if there's many results and side effects from each side of the branch, since each one will require a conditional move to select the correct result.
IMO you probably don't need to really worry about it in most cases, you can just write your code with branches and let the compiler sort it out. Mainly you just want to know what to expect in terms of when a branch can and can't save you performance, dependent on how divergent the branch is within your warp/wave. For “early out” optimizations it can sometimes make sense to use wave intrinisics to only take the early-out case if all threads can take the early-out, that way you don't have to execute both the early-out and the expensive path.