Advertisement

What no one writes tutorials on...

Started by December 17, 2000 02:30 AM
6 comments, last by RyuKage 24 years ago
Seems like there''s at least one tutorial or article on any topic anyone could possibly want to know about, except of course for dinky little things that are incredibly trivial, yet quite enigmatic to some of us. Now I''m playing around with a little general purpose 2D RPG engine, something along the lines of Verge2 (but hopefully more robust). One of the things I want to do with it is to allow the scripts to request resources (sounds, sprites, maps, etc.) by filename, but not reload them if they''re already in memory. I would think that with long filenames and possibly having to dig through 10 or 20 entries in the resource manager, string comparisons would take too long. So I thought if I could hash the filename into an unsigned int, I could index loaded resources by that integer handle. However, I can''t think of an algorithm to hash the strings to unique integer values, and no one seems to publish that sort of thing on sites like this. Anyone have an algorithm like that (preferably case-insensitive), or a suggestion for a better way to keep track of what''s already loaded?
If you want a decent hash, why not try CRC32? You should be able to find implementations with your favorite search engine...
Advertisement
Hashing algorithms do not produce unique values. That would be a translation algorithm such as translating a string of character digits to an integer representation. Hashing algorithms produce an index of an entry in a hash table. The hash table is basically a pointer to a list of all values that have hashed to that index. A linked list or array could be viewed as a hash table with one entry. The main thing is to choose a size for the hash table. Then your goal is to develop an algorithm that will evenly distribute the entries across that table. A simple algorithm is a table with 26 entries and using the first character of the string. That assumes the first character has to be alphabetic. Since it is a hashing algorithm you can hash both upper and lower case to the same entry. Now if all of your strings start with ''A'' then that isn''t a good algorithm. Generally a better algorithm is to treat the string as an array of integers, sum them and do a modulus divide by the number of entries in the table.
Keys to success: Ability, ambition and opportunity.
Well, thanks for the run-down of hashing algorithms, but if you read my problem description, that''s not what I need. I guess I was slightly mistaken on the meaning of "hash".

Well, anyway, I need to convert a string containing a legal Win9x/2000/ME filename to something just as unique but that can be compared more quickly during a search of loaded resources (preferably an unsigned int), assuming that one run of the algorithm takes less time than 10 or so stricmp()s of strings greater than 10 characters in length.

I''ll look for CRC32, but that might be overkill for this situation...
quote: Original post by LilBudyWizer
Hashing algorithms do not produce unique values.


Yes and no. You can''t pick some random hash algorithm off the street and expect it to work on your data set without collisions. However, knowing your data set, you can come up with perfect hashes where no collisions occur.

quote:
Generally a better algorithm is to treat the string as an array of integers, sum them and do a modulus divide by the number of entries in the table.


Even better is something like CRC32 or similar. Generally, CRC32 values (32-bit integers) are used to verify that the contents of some data are intact (CRC== cyclic redundancy check). Hand it a chunk of data and get back an integer. Do it again later (say, after passing it across a network) and, if it has the same integer value, it is intact.

Because the integer is 32-bits, you''ve got a range of 4 billion plus ids. Hashing does not require storing a lookup table at all if there are no collisions, since the objective is to turn your string into an integer and not necessarily remember the strings. Only if there are collisions do you need some way to resolve the conflict. This is VERY unlikely with CRC32 (which is better than a simple sum and modulus). Even if it happened with CRC32, you can do better than a simple table of linked lists to resolve collisions.




---- --- -- -
Blue programmer needs food badly. Blue programmer is about to die!
---- --- -- -
New York. New York. New York. Texas. Texas. New York. New York. Canada.
---- --- -- -
quote: Original post by RyuKage

Well, thanks for the run-down of hashing algorithms, but if you read my problem description, that''s not what I need. I guess I was slightly mistaken on the meaning of "hash".

Well, anyway, I need to convert a string containing a legal Win9x/2000/ME filename to something just as unique but that can be compared more quickly during a search of loaded resources (preferably an unsigned int), assuming that one run of the algorithm takes less time than 10 or so stricmp()s of strings greater than 10 characters in length.

I''ll look for CRC32, but that might be overkill for this situation...



But you are asking for a hashing algorithm: to turn a string into a unique integer is the most common kind of hash. And there is no need to store huge tables or lists of strings when checking for collisions if you do things right.

CRC32 is not that difficult. Actually quite simple. It is widely used because it works very well; the only thing "complex" about it is its data table, which is very easily generated once with a separate program.

If you decide against CRC32, then you should take a moment to analyze the types of strings you are sending in and come up with a hash to fit the patterns you find. For example, you''re sending in filenames. Do they have paths? It might make sense to have a dual hash: one hash algorithm for the path (16-bits), a separate hash function for the filename (16-bits), and the two are joined together to make a single 32-bit int. I haven''t tested this particular hash algorithm, so you might find something better.

But for pre-existing, well-tested, widely-used hashing functions (although it''s normally called a redundancy check), CRC32 is pretty decent.

---- --- -- -
Blue programmer needs food badly. Blue programmer is about to die!
---- --- -- -
New York. New York. New York. Texas. Texas. New York. New York. Canada.
---- --- -- -
Advertisement
Binary Linked Tries.
quote: Original post by mossmoss
But you are asking for a hashing algorithm: to turn a string into a unique integer is the most common kind of hash. And there is no need to store huge tables or lists of strings when checking for collisions if you do things right.

CRC32 is not that difficult. Actually quite simple. It is widely used because it works very well; the only thing "complex" about it is its data table, which is very easily generated once with a separate program.

But for pre-existing, well-tested, widely-used hashing functions (although it''s normally called a redundancy check), CRC32 is pretty decent.


Well, I thought I was looking for a hashing algorithm, but LilBudyWizer implied otherwise. At any rate, thanks for actually answering the question I asked, and with enough detail for me to be sure you know what you''re talking about. It sounds like CRC32 will probably work.

This topic is closed to new replies.

Advertisement