It can be rather complicated as there are many options.
The behavior is different based on the system. Different hardware, different compilers, and different compiler settings all make a difference.
The actual DirectX libraries have their own conventions that the compiler supports precisely, and they are different in 32-bit and 64-bit versions. Calls to your own functions can possibly support those same calling conventions, but they can also end up getting compiled with lower performance options or with higher performance options.
The linked story talks of both SSE and AVX values. There are differences between the systems both in terms of operations and in terms of how they are passed.
Slightly older compilers would pass them on the stack. Newer compilers (MSVC2015 for example) can potentially pass them through the XMM registers, with 64-bit code providing better support than 32-bit code because there are more CPU registers guaranteed on the 64-bit chipsets.
Additionally, compilers that can target older chips often used the FPU for floating point values, but newer version use SIMD operations in the MMX or XMM registers for floating point values. 64-bit version guarantee SSE and SEE2 features which the compilers can use where the 32-bit versions do not .
The concern about those switches is that you need to guarantee they're available. By default 64-bit code assumes a processor built after about 2001, 17 years ago. So if you wanted to take advantage of AVX2 instructions and the improved instructions and registers, you wild be limited to Haswell-like (and later) processors. Your program would crash on older CPUs. If you use compiler options like /arch:avx or /arch:avx2 that guarantee the presence of additional features, like YMM and ZMM registers the compiler can make different choices. On the other side, it is possible in 32-bit code to disable XMM, disable MMX, or to require the x87 FPU, effectively generating code that could run on CPUs from the 1990s. Those are entirely up to you as compiler options, useful if you can guarantee things about the target computer.
Certain calling conventions, such as __vectorcall (rather than __fastcall) can also make a difference, as can the ordering of parameters. 32-bit can potentially support up to 6 values in those registers, but again it all comes down to details like those listed above. You would need to specify those in your code.