Robot Webcrawler
For my Systems Programming class I made a webcrawler that giving a list of links with crawl and get anchor and image references and output those to a new file. It isn't recursive but if I move the output file to the input file I could run it multiple times. I'm still ironing out some things such as handling "301 page moved" errors, and references that use javascript. I haven't coded it to respect robots.txt files yet either. Has anyone else coded this type of program? Anyone run into any frustrating roadblocks? I'm trying to work through them one at a time. I think it is also annoying when I think my program is malfunctioning because I'm getting references with email address on them and then track it down to someone not coding the reference correctly (< a href="webmaster@thissite.com" > email me </a> forgetting the mailto:). If I don't catch these then it will get lots of 404 not found pages when trying to lookup bogus links. The only restriction my professor put on the project is that it has to use shared-objects and work as a pipeline (this is the actual objective of the project). This restriction makes it hard to make it recursive because I can't feed the output files directly into the input (it doesn't fit with his model). Also any useful features that you added onto it? Just thinking about it and I think I should add some extra output about bad URL references, but then I'd have to save more information about which page referenced what page :(
1. relitive adresses are teh evil!
2. Look for things that shouldn't exist. (ie. links that don't have .htm, .html, ect.) or that you shouldn't look up. (asp files, for eg.)
3. Be sure to see that the domains exist before adding them as emails/links. (have a dns server handy. Just have one list of "pending" domain names, and another for "confermed" domain names. You send off the requests for each pending domain as you put it on the list, and when you get it back, you move it to the other list.)
4. Look out for mangled html!
5. Remove any error pages! (look at the title, and you can work it out from there).
Thats about it, for now. (i made a spider ages ago tho)
From,
Nice coder
2. Look for things that shouldn't exist. (ie. links that don't have .htm, .html, ect.) or that you shouldn't look up. (asp files, for eg.)
3. Be sure to see that the domains exist before adding them as emails/links. (have a dns server handy. Just have one list of "pending" domain names, and another for "confermed" domain names. You send off the requests for each pending domain as you put it on the list, and when you get it back, you move it to the other list.)
4. Look out for mangled html!
5. Remove any error pages! (look at the title, and you can work it out from there).
Thats about it, for now. (i made a spider ages ago tho)
From,
Nice coder
Click here to patch the mozilla IDN exploit, or click Here then type in Network.enableidn and set its value to false. Restart the browser for the patches to work.
There is absolutely no rule that HTML pages should end in .htm, .html or indeed have a . in their name at all.
This is simply a convention often used. Determine whether something is a HTML page by examining its "Content-type" header.
Remember that URIs are in most cases case sensitive.
A lot of web robots ignore links to certain types of filenames because they're not usually HTML pages (.e.g .png .jpg, .zip etc).
Not respecting robots.txt is considered rude.
Be careful that you don't spider any single site too quickly - anything more than 2 hits per minute would be generally considered rude.
---
What HTML parser are you using? What language are you using?
In the past I've used libxml2 (Which despite its name contains a HTML parser). This seemed good.
Don't write your own HTML parser. HTML is MUCH harder to parse than XML, due to it usually not being well-formed and containing all sorts of other errors. Also bear in mind that some elements (mostly "script") need to be parsed completely differently. libxml2's HTML parser ignores the contents of script elements and doesn't return them in text nodes.
This is because, even if a script element contains all sorts of other elements, they should not be created as nodes in the document.
---
Always be mindful of the encoding of the web page you're loading. Not all pages on a site may be in the same encoding. You can check the headers to determine this.
I've used libxml2, and it automatically determines the encoding from the headers and returns all data in utf8 regardless of the original encoding. This is a GOOD THING :)
Mark
This is simply a convention often used. Determine whether something is a HTML page by examining its "Content-type" header.
Remember that URIs are in most cases case sensitive.
A lot of web robots ignore links to certain types of filenames because they're not usually HTML pages (.e.g .png .jpg, .zip etc).
Not respecting robots.txt is considered rude.
Be careful that you don't spider any single site too quickly - anything more than 2 hits per minute would be generally considered rude.
---
What HTML parser are you using? What language are you using?
In the past I've used libxml2 (Which despite its name contains a HTML parser). This seemed good.
Don't write your own HTML parser. HTML is MUCH harder to parse than XML, due to it usually not being well-formed and containing all sorts of other errors. Also bear in mind that some elements (mostly "script") need to be parsed completely differently. libxml2's HTML parser ignores the contents of script elements and doesn't return them in text nodes.
This is because, even if a script element contains all sorts of other elements, they should not be created as nodes in the document.
---
Always be mindful of the encoding of the web page you're loading. Not all pages on a site may be in the same encoding. You can check the headers to determine this.
I've used libxml2, and it automatically determines the encoding from the headers and returns all data in utf8 regardless of the original encoding. This is a GOOD THING :)
Mark
One entirely different approach would be to make a web spider as a Mozilla application, i.e. using Mozilla's platform.
This is probably the only reasonably sane way which you can get it to run javascript, because you'd be able to get Mozilla to create an in-memory instance of a Mozilla HTML DOM object for the page to allow the JS to run.<br><br>Mark
This is probably the only reasonably sane way which you can get it to run javascript, because you'd be able to get Mozilla to create an in-memory instance of a Mozilla HTML DOM object for the page to allow the JS to run.<br><br>Mark
Quote:
Original post by Nice Coder
1. relitive adresses are teh evil!
Your opinion perhaps, but they are still widely used so the spider must support them. Correctly.
Quote:
2. Look for things that shouldn't exist. (ie. links that don't have .htm, .html, ect.) or that you shouldn't look up. (asp files, for eg.)
You can't possibly tell whether something will exist or not based on its URI. Different web sites have different naming conventions. My site might call all its pages "something.mr"
Quote:
3. Be sure to see that the domains exist before adding them as emails/links. (have a dns server handy. Just have one list of "pending" domain names, and another for "confermed" domain names. You send off the requests for each pending domain as you put it on the list, and when you get it back, you move it to the other list.)
mailto: links should not be spidered. Specifically, your spider should understand protocols and know what protocols not to try to spider.
Needless to say, even if you tried to spider a mailto:, your URL fetcher should refuse to try and fetch it (as it's impossible).
Quote:
4. Look out for mangled html!
Use a decent HTML parser, don't write your own.
Quote:
5. Remove any error pages! (look at the title, and you can work it out from there).
Nonsense. The title of the page should not be how you determine it's an error.
Use the HTTP status to determine whether something is an error page or not.
Mark
Thanks for the replies.
I will respect the robots.txt once I finish up other things.
I throw out mailto: addresses but I noticed some bad links where they forget the mailto: part. Is a @ against legal filenames?
Next on my list is to not safe 404 errors (I just handled the 301 error, which I'd get from requesting a directory but leaving out the final '/' which a lot of links do.
For the time being, I'd using my own HTML parser. I throw out any links that don't parse well, which are usually ones that have javascript in it. It has a lot of error checking and I think I have been lucky with the pages it has loaded so far because none of them caused my program to crash (due to bad html).
More than 2 hits per minute is really considered rude? I would think that more than 1 a second would kind of be slamming the server (though with my local host it can run a couple hundred in a second). I guess I could limit the threading of that part to only have 1 or 2 fetching pages at a time.
I was in my car coding out how to handle relative links, I will implement it tomorrow night and see how well it works out. If I don't handle them correctly then it will recursively get worse each run [wow].
Oh, I am using C++ and vi with sockets, in linux. That is how my school classes are... hardcare (though C# would probably make the coding so much easier).
Right now I fetch files like zip/rar/jpg so it almost creates a mirror of the site (though server run scripts makes it impossible).
The encoding of pages isn't set in the actual html but in the heading? Since it is saving the page in binary, it shouldn't matter right? The browser would have to determine the encoding. I will have to look into this part more as well.
I will respect the robots.txt once I finish up other things.
I throw out mailto: addresses but I noticed some bad links where they forget the mailto: part. Is a @ against legal filenames?
Next on my list is to not safe 404 errors (I just handled the 301 error, which I'd get from requesting a directory but leaving out the final '/' which a lot of links do.
For the time being, I'd using my own HTML parser. I throw out any links that don't parse well, which are usually ones that have javascript in it. It has a lot of error checking and I think I have been lucky with the pages it has loaded so far because none of them caused my program to crash (due to bad html).
More than 2 hits per minute is really considered rude? I would think that more than 1 a second would kind of be slamming the server (though with my local host it can run a couple hundred in a second). I guess I could limit the threading of that part to only have 1 or 2 fetching pages at a time.
I was in my car coding out how to handle relative links, I will implement it tomorrow night and see how well it works out. If I don't handle them correctly then it will recursively get worse each run [wow].
Oh, I am using C++ and vi with sockets, in linux. That is how my school classes are... hardcare (though C# would probably make the coding so much easier).
Right now I fetch files like zip/rar/jpg so it almost creates a mirror of the site (though server run scripts makes it impossible).
The encoding of pages isn't set in the actual html but in the heading? Since it is saving the page in binary, it shouldn't matter right? The browser would have to determine the encoding. I will have to look into this part more as well.
Quote:
Original post by nprz
More than 2 hits per minute is really considered rude? I would think that more than 1 a second would kind of be slamming the server (though with my local host it can run a couple hundred in a second). I guess I could limit the threading of that part to only have 1 or 2 fetching pages at a time.
Remember that a web server is a shared resource which will probably have better things to do than serving your robot.
Some pages might take it as much as 2 seconds of CPU time to generate if it uses a lot of server-side logic (Think of the gamedev.net "recent posts" page).
You don't want to significantly slow down the server.
Spiders such as Googlebot and Slurp (Yahoo) only send a couple of requests per minute to an individual site (but frequently continue to do so for hours at a time).
Of course Googlebot has a few million other sites to be spidering at the same time... :)
Quote:
I was in my car coding out how to handle relative links, I will implement it tomorrow night and see how well it works out. If I don't handle them correctly then it will recursively get worse each run [wow].
A lot of lame robots handle relative links incorrectly. These are the kind which are designed mostly for malicious purposes (for example email spam-address scrapers). They don't respect robots.txt or robots meta either. I've seen these bots generating lots of 404 errors on my sites.
Quote:
Oh, I am using C++ and vi with sockets, in linux. That is how my school classes are... hardcare (though C# would probably make the coding so much easier).
I don't think your choice of editor makes any difference to the program.
Ensure that you're not using C-strings anywhere in your code if possible - be aware that HTML pages can contain embedded nulls and that that's theoretically valid.
Also, if you have not converted the page to utf8, it might be in an encoding which is multibyte or can contain embedded nulls.
A good test is to have your spider look at a site which is in UTF16.
A bot could choose to ignore pages which were in an encoding it didn't understand. If spidering pages only written in English or western European languages, you can probably ignore all pages written in encodings other than
- US-ASCII
- utf8
- iso-8859-1 and closely related ones
- other aliases for encodings very similar to iso-8859-1
So anything in 8bit encodings for eastern euro, russian, arabic etc, or any of the eastern weird encodings would be ignored.
Quote:
The encoding of pages isn't set in the actual html but in the heading?
In some cases it's sent in the header. In other cases, it's in the HTML as a META HTTP-EQUIV tag.
Your engine should support both, and have a policy for handling situations where they disagree.
Browsers have some logic in them which does something like this:
- Does page appear to be in the encoding it says it's in? If so, use that encoding
- Otherwise, try to guess what encoding it's really in, and use that instead.
"Try to guess" is a very dangerous thing to do. I believe they look for strings of bytes which commonly occur in pages written in particular languages/encodings, and try to make sense accordingly.
The problem is that a lot of authors of non-western pages make pages which say they're in one encoding but are actually in another (An example is Russian).
Quote:
Since it is saving the page in binary, it shouldn't matter right? The browser would have to determine the encoding. I will have to look into this part more as well.
Depends what the purpose of your bot is.
If you're trying to read the text on the page, you absolutely need to know what encoding it's in, otherwise it will come out as gobbledegook.
Even if you just want to spider the site and read links, you'll still need to handle (for example) UTF-16 correctly.
Mark
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement
Recommended Tutorials
Advertisement