Did your scraper examine only these metrics, or did you keep all the data, allowing you to examine it offline for other interesting statistics?
I find it hard to believe that 11.1% of links on gamedev.net point to this user's profile (edit: that user is a marketplace seller, so maybe that explains it).
We had a code freeze last Thursday, so we didn't have time to implement saving data from the crawler to either resume crawling later or perform more post processing on the data. If we continue to develop the crawler further, this is one feature that I'd really like to add.
And it wasn't 11.1% of all GameDev links pointing to Arteria's profile, it was 11.1% of the top 10 links (which we probably could've made more clear in the output report), which is actually more like 11.1% of some smaller percentage (we later added a feature to tell us what percentage of links the top 10 links account for, but that was after this crawl of GameDev). For example, Amazon's top 10 links account for 4.6% of all their links, so I imagine GameDev's is something similar (so it'd be more like 11.1% of ~5%).
But I'd like to see how this was measured - were a set of words used, or was it against a spell checker? If the former, were unknown words presumably ignored? If the latter, how were things like proper nouns handled (or simply words not in the dictionary)? And I hope that this wasn't biased towards US English - restricting to English Wikipedia isn't sufficient, as many pages use British English. The fact that the BBC comes out so bad makes me even more suspicious that this was an issue...
ETA: I also note the percentage. The BBC has 30% of words misspelled? And Wikipedia 1 in 5? Sorry - no matter which site is better or not, there's no way that the number of misspellings is anywhere near that high on any of those sites.
Yeah, the spellings are by far the least accurate results. Our professer gave us the dictionary that's attached. This is what we did to determine if something was correct or not: convert it to lower case, look it up in the dictionary (which we stored in a hash map). If it wasn't there, and the word ended in apostrophe-S, take the root part of the word (without the apostrophe-S) and look up the root. If the root wasn't there, then we marked it as incorrect. All words were stored in another hash map with an associated counter that would count how many times it appeared (we had two hash maps, one for correct words and one for misspelled words).
Crawling was done in a breadth-first manner. All URLs were visited, but we checked the header before downloading, and if the Content-Type wasn't "text/html" we aborted the download and visited the next URL. URL fragment IDs were removed from URLs to avoid visiting the same page twice.
[edit]
And for what it's worth, I think about 17% of the class said they accidentally DOS'd someone at some point...