Once upon a time, there was an article published on gamedev.net [Roy2013], which described a way (as he says, mostly taken from Quake2 engine) to deal with Little-Endian/Big-Endian issues in games. While this approach is mostly sound ("mostly" because of unaligned-read issues which will be discussed below), it is not the most efficient one. Better (simpler, faster, and more general) approaches do exist, and they will be discussed below.
What is Endianness
Endianness itself has been described in many different works, including [[Roy2013] and [WikipediaEndianness]. Basically, it is a way that CPU stores multi-byte data in memory; little-endian systems store least significant byte first, and big-endian ones store most-significant byte first. So, if you have
uint16_t x = 1234;
then x will look as {0xD2, 0x04} on a little-endian system, and as {0x04, 0xD2} on a big-endian system. As a result, code such as
send(socket,&x,2);
will send different data over the wire depending on system endianness (little-endian systems will send {0xD2,0x04}, and big-endian ones will send {0x04,0xD2}).
It is important to note that endianness effects cannot be observed unless we have some kind of cast between pointers on data of different sizes (in the example above, there was an implicit cast from &x, which is uint16_t*, to void*, which is actually treated as byte pointer by the send() function). In other words, as long as we keep away from casts and stay within arithmetical and bitwise operations without pointer casts, the result is always the same regardless of endianness. 2+3 is always 5, and (((uint16_t)0xAA)[lessthan][lessthan]3)^0xCCCC is always 0xC99C, regardless of the system where our code is running. Let's name such calculations endianness-agnostic.
Scope
First of all, where do we need to deal with little-endian/big-endian issues? In fact, there are only two scenarios of which I know, where it is important. First one is reading files (which might have been written on a different machine), and another one is network communication. From our perspective, both of these cases are essentially the same: we're transferring data from one machine to another one.Serialization/Marshalling
One thing which should be noted for both these data-transfer-between-machines scenarios, is that you should never transfer data as a C structure; instead, you should serialize/marshal it. Putting C structure in a file (which may be read on another machine) or over the network, is a Really Bad Idea for several reasons. First of all, when writing/reading C structure to external storage, you're becoming a hostage of implicit alignment rules of the compiler you're using. In general, when you have a structure such as
struct X {
uint8_t a;
uint32_t b;
};
then sizeof(X) won't be 5 bytes as some might expect; in many cases sizeof(X) will be 8 bytes (1 byte of a, then 3 unused bytes of padding just to make b aligned on a 4-byte boundary, and then 4 bytes of b), but this is not guaranteed at all. To make things worse, the amount of alignment is not specified by standards, so when you're switching from one compiler to another one, it may change (not to mention switching between CPUs); to make things even worse, it can be affected by compiler switches and on a struct-by-struct basis by things such as #pragma pack.
If you are using types such as int or long (rather than guaranteed-size-types such as uint8_t and uint32_t), things worsen even further (yes, this is possible) - due to different sizes of these types on different platforms. Oh, and don't forget that variable-length strings and C++ containers are clearly off-limits.
There are other (rather minor) reasons for avoiding writing C structures directly: you'll write more data then necessary, the data written will include garbage (which will affect the ability to compress it), and so on. However, the most important issue is (lack of) inter-platform and inter-compiler compatibility mentioned above.
These issues are so important, that in the networking world sending C structures over the network is universally considered a Big No-No.
So, what you should do when you need to send a C structure over the network (or to save it to the file)? You should serialize it first (in network world term "marshal" is generally preferred, though it is essentially the same thing).
Implementing Serialization
The idea behind serialization is simple: for the struct X above you write one byte of a, and 4 bytes of b, avoiding alignment issues. In fact, you can go further and use, for example, VLQ [WikipediaVLQ] variable-length encoding, or put null-terminated strings into your serialized data. One way of serializing data (the one I prefer), is to have serialize/deserialize functions such as
void serialize_uint16(DataBlock&, uint16_t);//DataBlock should grow as the data is serialized
uint16_t deserialize_uint16(Parser&);//there is constructor Parser(DataBlock&)
When we have these functions implemented, then serializing our struct X will look as
DataBlock data;
serialize_uint8(data,x.a);
serialize_uint32(data,x.b);
(and deserializing will look similar).
So far so good, now let's see how we can implement our serialize_uint16() function. If we will implement it according to [Roy2013], it would look like:
void serialize_uint16(DataBlock& data,uint16_t u16) {
void* ptr = data.grow(2);//add 2 bytes to the end of data, and return pointer to these 2 bytes
u16 = LittleShort(u16);//calling ShortSwap on big-endian systems, and ShortNoSwap on little-endian systems
*(uint16_t*)ptr = u16; //(*)
}
This would work fine on x86 and x86-64, but on the other platforms the line marked as (*) may run into problems. The problem is that our ptr might be either even or odd; and if it is odd - some CPUs will refuse to read 2-byte data from it (also they will usually refuse to read 4-byte data unless its address is a multiple of 4, and so on). This never happens on x86/x86-64, but happens on SPARC, and may or may not happen on ARM (unless we specify __packed qualifier for uint16_t*, but it is not universally available).
Another Popular Alternative
Another popular alternative (thanks to Servant of the Lord for reminding me of it) is based on LITTLE_ENDIAN and BIG_ENDIAN macros. In some sense, it can be seen as the same serialize_uint16() as above, but using different implementation for BigShort() etc.:
//code courtesy of Servant of the Lord
#ifdef LITTLE_ENDIAN
#define BigShort(x) ShortSwap(x)
#define LittleShort(x) (x) //Do nothing, just 'return' the same variable.
#define BigLong(x) LongSwap(x)
#define LittleLong(x) (x)
#define BigFloat(x) FloatSwap(x)
#define LittleFloat(x) (x)
#elif defined(BIG_ENDIAN)
#define BigShort(x) (x)
#define LittleShort(x) ShortSwap(x)
#define BigLong(x) (x)
#define LittleLong(x) LongSwap(x)
#define BigFloat(x) FloatSwap(x)
#define LittleFloat(x) (x)
#else
#error No idea about endianness
#endif
While it is faster and less bulky than a previous one (see "Performance Analysis" section below), it has the same problem with unaligned read/writes on non-x86 platforms :-(. In other words, for serialization purposes it won't work, for example, on SPARC (and it working for ARM is not guaranteed).
What Is to be Done?
What is to be done? -- name of the novel by Nikolay Chernyshevsky, 1863 --The answer is quite simple, actually. Instead of writing two bytes as one chunk, we can always write it byte-by-byte:
void serialize_uint16(DataBlock& data,uint16_t u16) {
uint8_t* ptr = data.grow(2);
*ptr++ = (uint8_t)u16;
*ptr = (uint8_t)(u16 >> 8);
}//deserializing is very similar
This technique is well-known (see, for example, [Pike2012], but the idea is known for years before); a Really Good Thing about it is that we don't need to care about endianness at all(!). This stands because the code above doesn't perform any casts, and calculates all the bytes in a completely endianness-agnostic manner (see "What is Endianness" section above); and all writes are exactly 1 byte in size, so there is no chance for endianness to manifest itself.
While [Pike2012] argues that all the other marshalling methods represent a fallacy, I'm not so sure about it, and will describe an improvement over byte-by-byte marshalling in a moment.
Further Optimization
When I really care about performance (which I usually do, as server-side handling of billions of network messages per day is quite a significant load), I often add a special handling for those platforms of the most interest, for example (taken from [NoBugs2015]):
void serialize_uint16(DataBlock& data, uint16_t u16) { //assuming little-endian order on the wire
uint8_t* ptr = data.grow(2);
#if defined(__i386) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)
*(uint16_t*)ptr = u16; // safe and fast as x86/x64 are ok with unaligned writes
#else
*ptr++ = (uint8_t)u16;
*ptr = (uint8_t)(u16 >> 8);
#endif
}
With this approach, we have the best of both worlds: (a) universal version (the one under #else) which works everywhere, and (b) optimized version for specific platforms which we know for sure will work efficiently there.
One thing I would emphasize is the importance of using unsigned types when doing these kinds of integer operations. In particular, deserializing from signed byte array may give weird results when any of the bytes are negative.