The general idea is to try and get the final output as close to the output range (typically signed 16 bit) as possible, without clipping (which produces harsh distortion). This both maximizes the resolution of the sampled waveform, it also conveniently means different media tend to produce sound at comparable volumes.
In a pre-recorded section of audio (say a movie), it is possible to exactly normalize the entire 90 mins, such that the loudest point reaches the exact max of the range, and everything else is scaled to this. In live audio, the maximum loudness is hard to predict, so in order to avoid clipping, the choice is normally either:
-
have everything artificially quiet, to compensate for random local maxima
-
or use dynamic compression
However I will say this is subject to change somewhat in the future. There has been a general move towards floating point processing of audio rather than integer based, and already most music / audio packages will operate on float data. Float has the advantage that it is not subject to digital clipping. This could mean that if you are outputting to your OS as float, there is no longer a strict need for dynamic compression on output, and it could be left to the OS / output device / user preferences.
Aside from these points, any media is free to do what they want audio wise.
Typically though for game audio, individual sounds will be normalized, and there will be a data file or scripts (maybe edited by designers) that determines their relative volume when played in game (and maybe other things like effects etc), combined with some programming (e.g. make the sound louder according to physics), and some kind of 2/3d audio simulation to give things like the pan and the falloff of sound the further it is from the listener.
An exception to normalization of sound effects is when a number are recorded with the same microphone configuration and settings. This might be for example a series of footsteps, or voice recordings. In this scenario often the whole section of audio will be normalized together, then the separate sound sections split apart, such that their relative volumes are correct. They could for example be individually normalized, but then it would leave an extra unnecessary job for someone to balance their relative volumes in game.