Advertisement

Duplicates in Hash Table

Started by March 27, 2001 05:03 AM
3 comments, last by CoiN 23 years, 10 months ago
Lo All, Can anyone think of a solutions to this problem, I''ve loaded words from a text file into a has table and it all works fine except that I need to get rid of any duplicate words in the file. I''ve been trying all morning and my head is hurting too much to find a solution. The code which loads the words in to hash table is shown below:
  	
        for(count = 0; count < number_of_words; count++)
	{
		inner_count = 0;
		
		// Get Word from File

		fscanf(fp, "%s", word_buffer);

		// Get table position by checking for any collisions

		while (hash_table[(int)word_buffer[0] + inner_count].used_flag == ''y'')
			inner_count++;

		// Copy word into file and set used_flag

		strcpy(hash_table[(int)word_buffer[0] + inner_count].word, word_buffer);
		hash_table[(int)word_buffer[0] + inner_count].used_flag = ''y'';		
	}
	
  
I wanna try and get rid of the duplicates somewhere in that table, Anyone got any ideas? Thanks in Advance... CoiN
Well, first of all you should find yourself a better hash-function. Here's some pseudo-code for insertion in a closed hash-table, using linear probing and not inserting duplicates:
    void hash_table::insert(key, data){  int pos = hash_func(key);  int tmp = pos;  while(hash_t[tmp].is_occupied && hash_t[tmp].key != key)  {    tmp = (tmp + 1) % hash_size;    if(tmp == pos) // so have we wrapped around the whole table?    {      resize_hash_table(); // we need a bigger hash table    }  }  if(hash_t[pos].key != key)  {    hash_t[pos].data = data;    hash_t[pos].key = key;    hash_t[pos].is_occupied = true;  }}    

A warning, this code comes right out of my a** so it probably won't work, but the idea is what's important.

Edited by - amag on March 27, 2001 12:37:34 PM
Advertisement
Perhaps I''m not thinking to clearly this morning, but I have a hard time seeing that as a hash table. You should calculate a hash value off of the key and use that hash value as the subscript into the hash table. Since many values can hash to the same location the hash table normally just holds a pointer to a list. You usually keep that list ordered to speed the search for the specific entry. So with duplicates they would hash to the same list and the search for where to place it in the list identified would identify the duplicate. As an example if the expected number of hash collisions was very low you might use a linked list. You would then search the linked list for an entry that is greater than or equal to the entry being added. When you stop searching you check if it is equal which tells you whether it is a duplicate or not.

Coming up with a good hashing algorithm can be quite challenging. The challenge is that ideally you want each hash table entry to have a list that is equal in length to the total number of entries divided by the number of hash table entries. Many ways of hashing such as taking the first letter of a string skews the distribution heavily. A hash table with 26 entries generally does just as well as one with 52 or 256 if the first letter is usually an alphabetic character. The larger hash tables just increase the wasted space and doesn''t reduce the search time. Under ideal circumstances doubling the size of a hash table cuts the search time in half because it cuts the length of the list of entries in half. A simple hashing algorithm is adding up the characters in a string and then doing a modular divide by the number of entries in the hash table. That doesn''t do great, but does at least do better than the first character in distributing it over a somewhat arbitrary number of entries.
Keys to success: Ability, ambition and opportunity.
Here''s a fairly comprehensive page I found:
http://swww.ee.uwa.edu.au/~plsd210/ds/hash_tables.html

Apparently, you''re trying to resolve collisions by linear probing--skipping to the next hash value in the table. This requires knowledge beforehand of exactly how long your table will be. If you don''t have this knowledge, you will run into problems.

A common hashing technique for strings is to make a polynomial value out of the string. Ideally, if you had the string "CAT", you could make a unique value from it in this way:
key = ''C''*26^2 + ''A''*26 + ''T''
A common implementation is here:
int hash_string (const char *stg){  int value = 0;  while (*stg)  {    value += *stg + (value << 5);    ++stg;  }  return value;} 

This is different from the ideal because the polynomial is 32 intstead of 26, and there''s no accounting for the way ASCII gives non-zero values to letters and such, but it''s a decent and fairly fast method. It will overflow fairly quickly, but you''ll always have a word hash to the same value. Just take this key and modulo it with the size of your table.

Finally, I''ll address your question of searching for duplicates. If you''re using linear probing to resolve collision detection, you have two related problems:
1) If you want to avoid duplicates, you will have to have a way to check that, if a key exists at your hashed position, was it put there as a probe of a prior key, or is it the same value?
2) If you''re searching for data once you have your hash table put together, you''ll have to verify that the value you hashed to is actually the value you''re looking for; it might be from a linear probe of another value. Furthermore, how do you know when to stop looking for your value?

The solution will have to do with checking the value and hash key of the data already in the table. I''m really not sure what the best way to do it is; I personally always use chaining (array of linked lists) when I do hash tables. Hope that gives you some places to look.

yes, listen to stoffel and use buckets to store duplicate hashes

This topic is closed to new replies.

Advertisement