Advertisement

Compression

Started by April 19, 2000 09:35 AM
5 comments, last by MadKeithV 24 years, 8 months ago
In an attempt to move the discussions about compression out of the way-too-bloated and closed MP3 thread, I''m posting a new one. Anything on Compression is good in here, as long as it is thought-through, or an on-topic question. My first question is this: Isn''t Huffman-encoding generally the best way to compress data in a lossless way, if you find a way of reorganising the data so the patterns are consecutive? I believe it has been mathematically proven that Huffman Encoding generates the shortest file possible if you only detect consecutive patterns. I saw this in college when I was there - it reduces bit-patterns in size by indexing. The standard way would fail for a type of file, like a .wav file perhaps, where the pattern consists of parts of a byte/word changing with a higher frequency than another. By reconstructable reordering, Huffman also works very well on these files. But - the generalisation my brain wants to make, and one I''m not sure is correct - does that mean that given an optimal reconstructable reordering of the data, Huffman will always be the best post-reordering choice for compression? #pragma DWIM // Do What I Mean! ~ Mad Keith ~ **I use Software Mode**
It's only funny 'till someone gets hurt.And then it's just hilarious.Unless it's you.
Hmmmm, from your question, it doesn''t really sound like Huffman coding that you''re describing. Huffman doesn''t care where things are in relation to each other in the file, it only deals with relative frequencies of each "character" where you pick how many bits make a "character". So, the standard technique is to use 8-bit bytes. First, you scan the entire file, and count how often every byte appears. Then, you use variable length bit sequences to represent the bytes, with shorter sequences for more common bytes, and longer sequences for less common bytes.

-Brian
Advertisement
That''s what I meant by "consecutive" - it detects patterns per byte, i.e. 8 consecutive bits. But it is my impression that any pattern, even ones that do not follow the consecutive-bits thingie, can be reorganised so the applicable bits are consecutive after all. For instance by ordering the file with all the high order bits first, then the second, then the third, etc.

I think that it''s the reorganisation power that will determine the quality of your compression strategy, as you cannot improve on Huffman Encoding after that?


#pragma DWIM // Do What I Mean!
~ Mad Keith ~
**I use Software Mode**
It's only funny 'till someone gets hurt.And then it's just hilarious.Unless it's you.
Yes. Huffman is an optimal lossless algorithm and this is mathematically proven. Did the proof in college.

Programmers generally write Huffman using a symbol set that has 256 symbols (i.e. 1 byte word). However, the classic way of presenting Huffman is to use a 26 symbol set (i.e. the alphabet, all 1 case either upper or lower). The resultant file is a bit stream, as one would expect. ''E'', being the most popular letter gets something like 3 bits 101. ''Z'', being hated, gets something like 7 bits 0110101. The bit patterns for the symbol set can be stored in a binary tree (0 to the left, 1 to the right) that each leaf is a symbol. The leaves are not at the same level. Which handles our different bit lengths. The tree is built based on the actual frequency of the symbols in the file (this is the OPTIMAL compression strategy) or can use a tree built from set statistics (occurence in the English language, counted from a set of documents). ZIP files use LZ-Huffman (I think) which uses a preconfigured encoding table based on alot of previous data analysis, but not the frequency of the symbols in the file being compressed. (THAT LAST SENTENCE COULD BE COMPLETELY WRONG. MEMORY FAILURE IMMINENT. I AM 34 AFTER ALL. )

Two example compressions are:
EZ - 1010110101
ZE - 0110101101

They can be decompressed properly by starting from the beginning and only by starting from the beginning.

Decompression is a trace of the tree, directed by the file bit stream, and results in a decoded symbol each time a leaf is reached. Once a symbol is decoded return to the root node of the tree and continue tracing.

Damn cool algorithm if you ask me.

Mike Roberts
aka milo
mlbobs@telocity.com
NOTE: for files containing runs suchs as 0 1 2 3 0 1 2 3 0 1 2 3 ..., huffman won''t compress (at least in the file''s given state). This is where LZ methods are strong.
NOTE: for files containing runs suchs as 0 1 2 3 0 1 2 3 0 1 2 3 ..., huffman won''t compress (at least in the file''s given state). This is where LZ methods are strong.
Advertisement
Yes Anonymous poster - that''s exactly it.
That is a longer pattern that could be encoded if the algorithm moved from a standard 1-byte Huffman to a ( counting the characters ) 4-byte pattern instead.

However, then I could be more annoying, and produce a pattern like this:
1 2 3 / 1 2 3 / 4 5 6 7 / 4 5 6 7
Now there are two different pattern lengths...
And it doesn''t take a genius to complicate the picture a lot more.

In principle, I don''t think Huffman doesn''t have a problem with that - it just links indices to patterns, whatever their length may be. But pattern recognition is the really hard part - right?
I think fractal compression goes into that more deeply, does anyone know anything about it?





#pragma DWIM // Do What I Mean!
~ Mad Keith ~
**I use Software Mode**
It's only funny 'till someone gets hurt.And then it's just hilarious.Unless it's you.

This topic is closed to new replies.

Advertisement