Huffman encoding
I''ve been reading an article recently that describes huffman and Arithmentic encoding. What I dont get is how you store the data in a serial state and then know to deserialize it afterwards?
For example I realise that say in a sequence a certain token(string etc) may be represented more that other tokens. From that you can use less bits for the more frequently used tokens.
However once you get them all together how do you know which tokens are which? Does anyone know of any REALLY simple source code I can read that shows how this is accomplished?
Reason I''m asking is that a few months ago I was a typical newbie who thought that applying a zip like compression algorithm to my network stream would be cool.
Today I think that some fields would benefit from a hand crafter compression solution. Each end would use dynamic frequencies to stay in syncronisation. But the basic data unpacking still eludes me...
gimp-o
Chris Brodie
There''s a "dictionary" of the huffman code before the ''serial stream''. That is what makes the stream decodable.
A good C++ algorithm book should explain it pretty well. Sedgewick''s C/C++ Algorithms comes to mind.
On the web, do a search for "+Huffman +Source"..
I found this page right away; with source code:
http://kobe.cool.ne.jp/leona/code.htm
Hope that helps
// CHRIS
A good C++ algorithm book should explain it pretty well. Sedgewick''s C/C++ Algorithms comes to mind.
On the web, do a search for "+Huffman +Source"..
I found this page right away; with source code:
http://kobe.cool.ne.jp/leona/code.htm
Hope that helps
// CHRIS
// CHRIS [win32mfc]
Huffman encoding is normaly performed by counting the occurance of each character in the input stream. After the counting is done you would have a table of characters and their frequencies in the stream. You would then sort the table by frequency. Once you have a sorted table you then build a tree using the two chracter with the lowest frequency. You then assign the total of the characters frequencies to the root of the new tree and place it in the table and re-sort the table. You repeat this process until all of the characters are in a single tree. You then walk the tree traversing to each leaf and build an encoding table. You start at the root and and every time you travel down the left side of a node you assign a zero and every time you travel down the right side you assign a one until you reach a leaf. The character assigned to that leaf is now encoded with the bit pattern created by walking the tree to its position. Characters at the top of the tree have a greater frequency than characters farther down the tree, so they are encoded with a smaller number of bits.
In order to decode the resulting compressed bit stream you have to include the encoding table with the compressed stream. Using the encoding table you can rebuild the tree and the use the bit stream to walk the tree starting at the root and going left if the bit is zero and going right if the bit is one. When you reach a leaf, you output the character associated with that leaf and walk the tree again starting at the root.
Huffman compression is great for plain text and can give you about 40% to 50% compression. The fewer number of unique characters in an input stream and the less evenly the frequency of characters becomes, the more compression you get. However as the frequencies of the characters become more evenly distributed and the number of unique characters becomes greater the algorithim becomes less efficient.
If your goal is to compress small amounts of binary data for transmission over a network, you might want to look at other methods of compression. Huffman is very bad at encoding binary data which tends to have both fairly evenly distributed character frequencies, and have a large number of unique characters.
A good visual look at huffman encoding is available at http://swww.ee.uwa.edu.au/~plsd210/ds/huffman.html. There is a java applet that you can run on this page that will visualy show you how huffman encoding works.
Hope this helps and is not to confusing. (I tend to explain things poorly so bear with me )
In order to decode the resulting compressed bit stream you have to include the encoding table with the compressed stream. Using the encoding table you can rebuild the tree and the use the bit stream to walk the tree starting at the root and going left if the bit is zero and going right if the bit is one. When you reach a leaf, you output the character associated with that leaf and walk the tree again starting at the root.
Huffman compression is great for plain text and can give you about 40% to 50% compression. The fewer number of unique characters in an input stream and the less evenly the frequency of characters becomes, the more compression you get. However as the frequencies of the characters become more evenly distributed and the number of unique characters becomes greater the algorithim becomes less efficient.
If your goal is to compress small amounts of binary data for transmission over a network, you might want to look at other methods of compression. Huffman is very bad at encoding binary data which tends to have both fairly evenly distributed character frequencies, and have a large number of unique characters.
A good visual look at huffman encoding is available at http://swww.ee.uwa.edu.au/~plsd210/ds/huffman.html. There is a java applet that you can run on this page that will visualy show you how huffman encoding works.
Hope this helps and is not to confusing. (I tend to explain things poorly so bear with me )
Ahhhh... I think I''m starting to get it now. I was planning on using a dynamic method where the table frequencies were generated at runtime at both end of the transmission. That way I''ll never have to transmit the frequency table.
What would be a better choice for compressing binary data?
Thanks
gimp
What would be a better choice for compressing binary data?
Thanks
gimp
Chris Brodie
It depends on how much data you want to compress and what type of data it is.
If you are sending large amounts of binary data, LHA is a good choice and I don''t think it is encumbered by any patents like LZW (used in the gif image format). You can also find LHA source code on the net.
If you need to compress smaller data structures being transmitted across a network, your best choice is probably compression by exclusion. Design your data structures in such a way that you only need to send the changes to the structure across the network. Eleminate any fields in your structure that have not changed from the last time the structure was sent. Why send data that the client already has?
If you are sending large amounts of binary data, LHA is a good choice and I don''t think it is encumbered by any patents like LZW (used in the gif image format). You can also find LHA source code on the net.
If you need to compress smaller data structures being transmitted across a network, your best choice is probably compression by exclusion. Design your data structures in such a way that you only need to send the changes to the structure across the network. Eleminate any fields in your structure that have not changed from the last time the structure was sent. Why send data that the client already has?
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement
Recommended Tutorials
Advertisement