For a reference see Intel® Advanced Vector Extensions Programming Reference:
[table]
[th='2']Table 2-4. Instructions Requiring Explicitly Aligned Memory[/th]
[tr][td]Require 16-byte alignment[/td][td]Require 32-byte alignment[/td][/tr]
[tr][td](V)MOVDQA xmm, m128 [/td][td]VMOVDQA ymm, m256[/td][/tr]
[tr][td](V)MOVDQA m128, xmm [/td][td]VMOVDQA m256, ymm[/td][/tr]
[tr][td](V)MOVAPS xmm, m128 [/td][td]VMOVAPS ymm, m256[/td][/tr]
[tr][td](V)MOVAPS m128, xmm [/td][td]VMOVAPS m256, ymm[/td][/tr]
[tr][td](V)MOVAPD xmm, m128 [/td][td]VMOVAPD ymm, m256[/td][/tr]
[tr][td](V)MOVAPD m128, xmm [/td][td]VMOVAPD m256, ymm[/td][/tr]
[tr][td](V)MOVNTPS m128, xmm [/td][td]VMOVNTPS m256, ymm[/td][/tr]
[tr][td](V)MOVNTPD m128, xmm [/td][td]VMOVNTPD m256, ymm[/td][/tr]
[tr][td](V)MOVNTDQ m128, xmm [/td][td]VMOVNTDQ m256, ymm[/td][/tr]
[tr][td](V)MOVNTDQA xmm, m128 [/td][td]VMOVNTDQA ymm, m256[/td][/tr]
[/table]
[table]
[th]Table 2-5. Instructions Not Requiring Explicit Memory Alignment[/th]
[tr][td](V)MOVDQU xmm, m128[/td][/tr]
[tr][td](V)MOVDQU m128, m128[/td][/tr]
[tr][td](V)MOVUPS xmm, m128[/td][/tr]
[tr][td](V)MOVUPS m128, xmm[/td][/tr]
[tr][td](V)MOVUPD xmm, m128[/td][/tr]
[tr][td](V)MOVUPD m128, xmm[/td][/tr]
[tr][td]VMOVDQU ymm, m256[/td][/tr]
[tr][td]VMOVDQU m256, ymm[/td][/tr]
[tr][td]VMOVUPS ymm, m256[/td][/tr]
[tr][td]VMOVUPS m256, ymm[/td][/tr]
[tr][td]VMOVUPD ymm, m256[/td][/tr]
[tr][td]VMOVUPD m256, ymm[/td][/tr]
[/table]
In http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf, we can read in section 3.6.4 that:
Misaligned data access can incur significant performance penalties. This is particularly true for cache line
splits. The size of a cache line is 64 bytes in the Pentium 4 and other recent Intel processors, including
processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory accesses and requires several
µops to be executed (instead of one). Accesses that span 64-byte boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on machines with longer pipelines.