Advertisement

Hash tables - how do they work?

Started by September 18, 2000 07:43 PM
6 comments, last by Houdini 24 years, 4 months ago
I''ve was looking at some code on the net, and I saw someone using hash tables but I had now clue what they were. After doing some research I was able to find out that they make retreiving data from large linked-lists very quick, but I don''t know how. Does anyone have an idea on how they work? - Houdini
- Houdini
I''m soo glad you asked that!

I am currently working on a server that uses a large hash table. I have a list of users, each with a username which is just a string, and a unique identifier. The actual table is a flat array of user structures that can be indexed by the unique identifier, which is the number representing the position of the data for a particular user in the array. Obviously, if I know the UID of the person I wish to retrieve data about, then it is very fast, however, if I have a username then ordinarily I would have to search the list to find that person''s details. A hash table sorts this problem out.

A hash table is like a multi-dimensional linked list, where each node in the first list references the start of another list. Lists organised like this are called buckets, and a bucket may have any number of sub-buckets and so on. To use your list you have what is called a ''hashing algorithm''. For example, the hash table that I use adopts an algorithm that, based on the characters in the username, will return the reference to a particular bucket.

If you have a user name that is alphanumeric i.e. there are 36 combinations of each char in the string (ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ignoring lower case) then you would have 36 lists to start off with. Each seperate list could then contain all the users with a user name starting with the character assigned to that particular bucket, effectively partitioning the data 36 ways, and reducing any search time by a factor of 36:

Hash table:
Bucket 1: All users with the first character == ''A''
Bucket 2: All users with the first character == ''B''
Bucket 3: All users with the first character == ''C''
...

Thusly as you can see you have a list of buckets. Then, if you take it to the next level, you can make every bucket in turn reference another 36 buckets each, and be able to get data organised by the first character and the second character of the user name - which reduces search time by a factor of 36 * 36, which is quite a lot faster than a linear search, as you can imagine. You can nest buckets as far as RAM will permit, and each time the searching will reduce by a factor of the product of the number of buckets at each depth. Nice!

Ok, that was fairly intense, I think I''ll go lay down now

Oh, and btw, the first time I asked that question, I was utterly confused as well!
Advertisement
Shit, sorry, that was me above - forgot to log in (DUH!)
I''m in a pedantic mood today, so I will just point out that what you are talking about is actually called a radix tree. In a hash table you usually try to have every bucket have fewer than some n number of collisions (usually around 16). If you get more than that you have to rehash and copy everything over to a new hash table with more buckets. Radix tree''s are often easier and have reasonable space requirements. There is just a lot of prejudice for either hash tables or balanced trees. You still see a lot of Radix trees used for things like string searching, partial matching and so forth. These are usually slight variations on a suffix tree (almost the same thing as what you are doing).
Actually, IIRC ''hash tables'' just refer to tables that are referenced by a hash function; the process of resizing your tables/rehashing isn''t necessarily a part of it.

So what''s a hash function? Essentially, a mapping (many-to-one) from some more complex type (a string, a vector, a structure, etc) onto (generally) the range [0..n] (for some n). The bucketing described by the anonymous poster (at least, the first level of it) really _is_ a hash function of sorts; it hashes strings by the algorithm ''take the first letter and convert it to a number 0..35''. This definitely works for hashing strings, but it''s generally a poor choice (especially for something like user names) because you won''t get even buckets. There are going to be a lot more users whose names start with ''S'', for instance, than with ''V''. There are a lot more effective string hashes available; one easy one is ''sum the ASCII codes of every character in the string and reduce mod n'', where n is the tablesize you want -- note that this one only works well for values of n less than a few hundred or so. Still, it should be better than the simple bucketing mentioned.

If you want more information on hashing -- including some formal analysis -- check Knuth''s _Art of Computer Programming_ or any good algorithm text, you should find something there.
I say you got to sparknotes.com and read the nice tutorial they have there ... I had never seen this hash table thing in 12 years of computing, and it''s damn easy
-----------------------------Sancte Isidore ora pro nobis !
Advertisement
Hey, thanks a lot guys. I think I understand it pretty well now. The only thing I'm trying to figure out is how I could create, say, a template class for hashing that would do everything behind the scenes to make them much easier to use for everyday stuff. Maybe I'd just have to point the template to a hashing callback function for each instance type I create...

Thanks again!

Edited by - Houdini on September 19, 2000 8:24:04 AM
- Houdini
have a look at hash_map you can find a good implementation at stlPort

Basically what you will do is pass a (possibly templated) function as a template paramiter. You can either require that your hash function is uniform over size_t or you can have a hash that takes two paramiters (key,range) and hashes uniformly across a range. There is very rarely a need to have a callback, unless you want to change the hash function at runtime.

This topic is closed to new replies.

Advertisement