While often there are general purpose file-formats or data representations to achieve a goal, sometimes it may make sense to create something a little more specialized, in cases where encoded size may be significant.
I will attempt to address some of the relative tradeoffs based on my own experiences, and experiences with various file-formats.
I will not address the matter of ASCII Text vs Binary file-formats here, and will assume the case where the data in question could not be viably represented in an ASCII format.
Similarly, I will focus more here on formats that can be processed "reasonably efficiently", rather than those focusing on maximizing compression. The primary use cases here are assumed to be for things like network communication and storage of game assets.
Note: I am not presently certain whether this is a "good" topic, or if it is presented adequately. Maybe it should be split up into multiple parts? (As-is, it is a little long).
Byte-Oriented Formats
These are generally much simpler than bit-oriented formats.
In the vast majority of cases, a byte-oriented format is likely to be sufficient for most general data storage needs, with bit-oriented formats more left for cases where either the data is overly bulky, or where there is significant need to minimize storage.
Generally, handling of byte-oriented data tends to be faster than that of handling bit-oriented data.
Within the confines of bytes there are still a few things that can be done to reduce storage requirements:
- Variable-Length Numbers;
- Encoding more items in less bytes.
Additionally, these are more fundamental in the sense that most bit-oriented formats will still usually require some sort of byte-oriented packaging format (such as using a TLV format to hold things like headers and similar).
Additional issues to consider:
- Flexibly representing data structures and allowing for future expansion;
- Allowing for seeking or resynchronization in stream oriented formats;
- Allowing for things like random access to (potentially compressed) data.
Variable Length Numbers
One simple trick for saving bytes is based on a pattern:
Small numbers tend to be more common than big numbers.
So, rather than always using 2, 4, or 8 bytes for a number, it may instead be practical to use less bits when it is practical to do so.
A fairly common scheme is to have small values represented as a single byte, with the high-order bit(s) used to indicate when the value is larger than a single byte.
For Example:
0-127 -> 0xxxxxx
128-16383 -> 10xxxxxx xxxxxxxx
16384-2M -> 110xxxxx xxxxxxxx xxxxxxxx
...
Another variant is using the high bit in each byte to indicate that the following byte contains more bits:
0-127 -> 0xxxxxxx
128-16383 -> 1xxxxxxx 0xxxxxxx
16384-2M -> 1xxxxxxx 1xxxxxxx 0xxxxxxx
...
With it then being specified whether the bytes come in high-low or low-high order.
The advantage of the former scheme is that the size of the value can be seen directly from the first byte, which can be used for a more efficient decoder, though with the tradeoff that the encoding/decoding logic tends to be more complex.
The advantage of the second is that the decoding logic is fairly simple, for example:
i=ReadByte(); val=0;
while(i&0x80)
{
val=(val<<7)|(i&0x7F);
i=ReadByte();
}
val=(val<<7)|i;
Usually, the encoding logic is a little more involved though, but the level of complexity is typically "fairly similar" with either strategy (usually an if/else tree or a match/encode loop).
Another issue is how to best handle encoding signed numbers.
Typically, a few common options here are:
Sign-extending the most significant bit of the number;
Folding the sign into the least significant bit.
Sign extension has an advantage in that it allows for small positive signed and unsigned numbers to look equivalent, but requires the sign issue to be dealt with directly when encoding and decoding the numbers. This may in turn lead to a bit more logic in the case where both signed and unsigned variations are needed (since the logic for encoding and decoding the numbers may need to be repeated for each variation), or less efficient encoding if we simply encode all unsigned numbers as if they were signed (you may only be able to encode 0-63 in a single byte, regardless of whether the value represents a signed or unsigned quantity).
Sign folding has the advantage that it allows reusing the same number encoding/decoding logic for both signed and unsigned numbers by mapping all signed integers to unsigned equivalents. For example, we may map things such that the unsigned quantities map to signed quantities as:
0, -1, 1, -2, 2, -3, 3, ...
This may be done fairly straightforwardly:
//to encode:
uval=(sval>=0)?(sval<<1):(((-sval)<<1)-1);
//to decode:
sval=(uval&1)?(-((uval+1)>>1)):(uval>>1);
//or, expressed alternatively:
if(sval>=0) { uval=sval<<1; } else { uval=((-sval)<<1)-1; }
if(uval&1) { sval=-((uval+1)>>1); } else { sval=uval>>1; }
The downside is, granted, that signed and unsigned quantities naturally have different representations.
A few downsides of variable length numbers is that typically they are slightly more expensive to encode and decode, as well as their value generally needs to be known before they are encoded, as they are less subject to the strategy of emitting a placeholder and patching the value later, which will typically require leaving a space for the largest possible, or reasonable, value.
Packing more values into less bytes
This is basically using one or more of these techniques:
Using different bits within a byte or number to represent different values;
Assigning special meanings to various value-ranges, and switching to alternate representations based on these value ranges;
Optionally including or omitting things based on flags or value-ranges.
A very simple and common example of this sort of thing is a "flags" field, where different bits are assigned meanings, and may be set or cleared, or potentially small multi-bit fields are encoded within a flags field.
A simple example of using some of these would be encoding a sparse series of typically small integers as bytes:
High 4 bits gives the number of preceding zeroes;
Low 4 bits gives the numeric value.
So, we may encode the series of numbers 1,0,0,0,3,0,0,0,0,0,9,... as 0x01,0x33,0x59,...
However, it may not escape notice that the maximum run-length and value-range is very limited, and we may also want to be able to encode the end of a list.
The end-of-list case is easy: 0x00. By this, we simply do not allow a run of zero-zeroes to encode a zero.
The other case can be handled via reserving part of the value range for trap-encodings, for example: 0-14=value is encoded directly, 15=value is encoded as a variable-length number. Now, suddenly, we can moderately efficiently encode a sparse collection of arbitrary sized numbers, and without wasting space for long runs of zeroes.
Another possible variant is that rather than the high bits giving zeroes, the bits indicate how many times to repeat a given value. This will slightly increase the cost of runs of zero (since they are now an explicit value, and we also lose the nice EOL special case), however, this will make runs of most other values significantly cheaper.
For example: 1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,0,0,4,63,63,63,...
Could be encoded as: 0x61,0x32,0x43,0x10,0x04,0x2F,0x3F,...
Tagged Data and Escape Coding
Tagged Data is basically data where some form of tagging-scheme is intermixed with the data. Commonly this is accomplished by designating one or more byte-values as "special", and not allowing them to appear in normal data, or if they are, it is typically via an escape coding.
For example, we could consider 0xFF to be special and not allow it within ordinary data (example: JPEG). Whenever an 0xFF would appear, we may instead replace it with, say, 0xFF followed by 0x00, with 0xFF followed by other values being used to indicate various markers or tags.
Often, these formats rely on the ability to, when in doubt, scan forwards to the next marker.
This escape-coding case comes at a cost though, namely in that it typically results in a slight expansion of the encoded data, and in some performance-sensitive applications, may incur additional costs related to detecting and handling the escape-byte. It may also potentially hinder the use of fixed-size structures or file offsets, or in some cases result in additional use of temporary buffers.
It is sometimes an option to relax this requirement though, such as relying on heuristic measures to identify markers (example: ZIP, BZ2). In these cases, usually the marker will be longer, and may include check-values in order to help reduce the chances of a false positive.
These types of formats may also allow for optional extension markers, where if such a marker is encountered and is not recognized, the marker's data will be skipped over.
TLV Formats
TLV formats are another common strategy, where TLV stands for Tag Length Value. The Tag is a value that indicates what type of data is contained, the Length tells how big it is, and the Value is all the data for the tag. This construction is often referred to as a
Lump or
Chunk.
TLV formats often have an advantage in that they can be fairly readily extended without breaking existing implementations, and can easily hold a wide range of data. Very often, TLV formats are used for building the packaging for another file-format.
Depending on the format, unknown lumps may either be ignored or may potentially result in the data being rejected, and some formats will indicate whether or not the decoder should reject unknown lumps on a lump-by-lump basis.
A common example of a TLV format is RIFF, which is used to implement formats such as WAV and AVI (among others).
PNG and MKV are examples of formats which make use of TLV for their packaging (MKV being notable in that it uses variable-length encodings for both tags and lengths).
It is also possible to use both escape-coded markers and TLV, such as to allow for things like resynchronization.
For example, RIFF has a structure like:
tag:FOURCC, length:DWORD, value:BYTE[(length+1)&(~1)]
Where, some chunks (namely: 'RIFF' and 'LIST') may be nested, containing another FOURCC, followed by any number of other chunks. Most other chunks will simply contain data.
Another possible TLV format is:
0xFF, 0xEC, length:WORD, tag:FOURCC, value:BYTE[length-6]
Where FF EC is a magic escape-sequence, and the data uses an escape-coding scheme.
We might use an escape-case for the length, say:
0xFF, 0xEC, 0:WORD, tag:FOURCC, length:DWORD, value:BYTE[length-10]
Or, alternately, allow tag to be 0, at which point the lumps will be limited to 64k, but are appended together to form encode a larger payload.
A few other possible variants:
length:WORD, tag:TWOCC, value:BYTE[length-4]
Where chunk headers are a little more compact.
tag:VLI, length:VLI, value:BYTE[length]
Where tag and length are variable-length. This allows very compact headers in some cases, at the cost that they are more difficult to work with.
Random Access to Data
Generally, random access to data is achieved via the use of some sort of index structure, which may give the offset of each item within the file.
For example, in a video container format, we might encode the offset of every frame in the video file (such as AVI), allowing a video player to easily jump to any position in the file as is desired.
Something like a region-file in a voxel-engine might instead have an index table holding the offset of every chunk in the region, allowing those parts of the world to be decoded as they come into view.
In other cases, we may rely on the size of the data to calculate where to look. An example is in an array of fixed-size structures, where one can easily calculate where each member is, making an explicit index unnecessary.
Another example is a CBR format. In this case, we may have a format, such as for an audio or video file, and may calculate the offset within the file based on its time-value. Like with the array, if every frame (or group of frames) is encoded within a fixed-size box, then we can easily calculate where the frame is at based on the desired time-value and a known reference point.
In either case, to allow for such random access, it is important that the data can be decoded without reliance on preceding data. This may come at a (typically minor) cost to the compression efficiency, since data within the preceding chunk can't be used by the chunk in question, and as a result will need to be repeated.
Packages with Named Members
Sometimes, it may also make sense for a format to use names or keys to refer to data members.
Historical examples of these include the Doom WAD and Quake WAD2 formats, which used a flat directory of lumps with 8 or 16 character names.
A related use-case is the Quake PACK format, which was structurally similar to a WAD file, with the primary practical difference of having a much larger name field (56 characters), and storing a file-path (rather than a simple lump name).
Some formats also use a ZIP file to package up its contents, which has the advantage of being a fairly standardized container which supports both named members and Deflate. Examples: ODT, DOCX, and ORA (OpenRaster).
The advantage is that these allow the ability to dynamically refer to members via names or paths, which can be useful if storing data which is accessed by a name or path.
Sometimes it may also make sense to combine the use of named members with a TLV format, where certain lumps or chunks may be named, rather than by using an explicit directory. This may make sense when the number of lumps is small and the data is all closely related.
Data Compression via Deflate
This has been moved mostly to a new draft.
One simple and common option I will mention here is that of using something like Deflate (for example, via the common zlib library) to compress the data. This can significantly reduce the size for many types of data, and the data can be decoded reasonably quickly.
Many file formats (including, for example, PNG) have used this strategy with good success.
Conclusion
Designing a file format can be useful, but like anything, there may be pitfalls involved, like time/effort involved, or ending up with a format that no other software can read or write, ... So, it may make sense to balance these sorts of things against the needs of the project.
Most of these topics can be found in more detail elsewhere, such as on Wikipedia.
For those who want to know in-depth how various file formats work, it might be interesting to find and read the various file-format specifications.
Also note that file-format specifications may often be a source of nifty tips or tricks, or possible design ideas, ...
While this is far from a complete (or even adequate) coverage of the topic, maybe it is enough to hopefully at least be informative.
Article Update Log
27 Aug 2014: Writing Article
28 Aug 2014: Still writing Article
11 Sep 2014: Split article, data compression moved to new draft.
I only read the first few paragraphs and skimmed over the article (but I've bookmarked this for further reading) and I noticed it seems a little chaotic at first glance. Maybe you could improve a little on the formatting, such as putting lists or the like, this should make the article more readable as well.
These are only formal complaints though - the info provided here seems like a very helpful overview on available (compressed) binary formats. Will definitely have my try at some of them!