Glad that it helped. Using #define _XM_NO_INNTRINSICS_ will just make your Release builds work the same as Debug builds with aligned XM types. It's not wrong per-se, but if you're doing a lot of matrix or vector operations, it's going to cost a significant amount of extra CPU utilization.
If you are doing a fair amount of vector/matrix operations (like 50 or 100+ per frame) and don't want to build for x64 target, I'd strongly recommend using the XMLoad and XMStore functions with temporary XMVECTOR/XMMATRIX local variables. They're really easy to use and you'll see a noticeable drop in cpu usage versus _XM_NO_INNTRINSICS_, despite the Load/Store overhead.
Aligning the allocations on the stack and heap are quite a bit more involved, and I can understand why you'd like to avoid that for now.
On that note I just wanted to say thanks to RobTheBloke and Alessio1989, because I wasn't aware of the details of heap alignment myself either. At some point I want to use AVX instructions and I'll need to align to 32 byte boundaries, so the additional info about aligned_malloc is appreciated.
quick edit:
Just for examples sake, assuming you store the world matrix for each object/mesh, this is how you'd use XMLoad when transposing a world matrix, for setting constant buffers:
//change your functions to take XMFLOAT4x4 arguments instead of XMMATRIX ones
void UpdateObjectConstantBuffer(ID3D11DeviceContext* context, const XMFLOAT4x4& worldMatrix)
{
//just use the XMLoad function as the XMMATRIX argument for the XMMatrixTranspose function
XMMATRIX transposedWorld = XMMatrixTranspose(XMLoadFloat4x4(worldMatrix));
//now map the constant buffer, copy over the transposedWorld matrix, unmap buffer, etc or however you did it before
...
}
So you'd use XMFLOAT4x4 instead of XMMATRIX for the storage type in your header files (e.g. class/struct members), then use function local XMMATRIX types only when you need them for DirextXMath library functions. The local transposedWorld is automatically aligned on the stack, so it's pretty simple, and more importantly it'll let you drop your _XM_NO_INNTRINSICS_ define, to take advantage of SSE intrinsic operation performance without worrying about manual alignment.