Building a search engine
I'm currently making a search engine, and i'm wondering on how to implement it.
Starting with the spider... i think i've got most of it down.
First off, with things that i need.
i was thinking of a a webindex file, which stores url's indexed by urlid (which is a guid for urls). as well as a second file which stores urlid's indexed by the url. (so i can very quickly go from url->urlid->url).
I would then have a second file, which stores a structure, which then stores all the urlid's that link to a particular urlid as well as the sites current pagerank. (for pageranking)
The problem is making a file format that can store any amount of links while allowing me to efficiently search it (using binary search, so for example for 2^32 records, or about 4 billion urls, it will take only 32 iterations to find a particular record, which is about the only way you can deal with data of the volumns that things like this have).
Now for the "search" part, i need to figure out what keywords corespond with what urlid's. (and by how much they correspond).
That has two parts. you need to find which words are "keywords" for a particular urlid, and you need to create an index of them, as well as some relivent data.
What i was thinking of doing is creating an index, with the number of times a particular word has been seen in total, as well as a list of all the urlid's that have this particular word, as well as by "how much" they are relivent to this keyword. (this suffers with the same problem as the pagerank index.)
I was thinking of adding a "related words" field to the index, so that i can find words that appear in the same sentences as a keyword. Maybe adding "relatedness" as a part of the heuristic used to determine the keywords would work in this?
The reson i want to know how many times that word has been seen, is that if i take the number of times it has been seen divide that by the number of web pages seen total, i can find the average number of times it has been seen per web page. I can then take the number of times it was seen on a particular page minus the mean then , divide it by the mean to find something similar-ish to the standard deviation. (i'm not sure about it, but it seems to work on paper)
From that number i can find the best few words from the website to add to the index.
So, basically a normal spider run would go.
grab the first link from the list of current links that haven't been seen yet.
get website from current link
Get a new list of links from links
Add new records for the webindex files for the new links.
Add the links to the pagerank index.
Scan the page to determine what words on the web page.
Lookup the index to find the statistics of the words that are there.
Find the best keywords using the index.
Store the urlid using the keywords.
go grab another url from the list of links that haven't been visited yet.
loop until enough sites have been searched or the time runs out.
Then i need to somehow sort a gigabyte or more of data, so that the database is searchable quickly. Radix sort seems to be the choise for this.
Now search i don't really know how to improve. The problem is that there are millions of results per keyword, and you have to search through them all. I was thinking of a hash table?
Comments, suggestions?
Click here to patch the mozilla IDN exploit, or click Here then type in Network.enableidn and set its value to false. Restart the browser for the patches to work.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement