Some GaveDev Stats...

Cornstalks · 2012-02-29T18:14:52

Well, I had a school project that required me and 3 team members to create a web crawler and compute various statistics from sites. Naturally, I crawled GameDev.net. Seeing as I'm sitting in history class and have little motivation to listen, I'm gonna post a few of the results. I was mostly surprised by the fact that gamedev has the lowest word to picture ratio. Each page must reference a lot of little images, I'm guessing. Anyway, we crawled 10,910 pages on GameDev.net and 41,363 pages on amazon.com, just to give you an idea of how many pages we crawled. And "the" is the most common word on GameDev.net, appearing 162,578 times across the 10,910 pages.

Cornstalks

Author

7,033

February 29, 2012 03:10 PM

Did your scraper examine only these metrics, or did you keep all the data, allowing you to examine it offline for other interesting statistics?

I find it hard to believe that 11.1% of links on gamedev.net point to this user's profile (edit: that user is a marketplace seller, so maybe that explains it).

We had a code freeze last Thursday, so we didn't have time to implement saving data from the crawler to either resume crawling later or perform more post processing on the data. If we continue to develop the crawler further, this is one feature that I'd really like to add.

And it wasn't 11.1% of all GameDev links pointing to Arteria's profile, it was 11.1% of the top 10 links (which we probably could've made more clear in the output report), which is actually more like 11.1% of some smaller percentage (we later added a feature to tell us what percentage of links the top 10 links account for, but that was after this crawl of GameDev). For example, Amazon's top 10 links account for 4.6% of all their links, so I imagine GameDev's is something similar (so it'd be more like 11.1% of ~5%).

But I'd like to see how this was measured - were a set of words used, or was it against a spell checker? If the former, were unknown words presumably ignored? If the latter, how were things like proper nouns handled (or simply words not in the dictionary)? And I hope that this wasn't biased towards US English - restricting to English Wikipedia isn't sufficient, as many pages use British English. The fact that the BBC comes out so bad makes me even more suspicious that this was an issue...

ETA: I also note the percentage. The BBC has 30% of words misspelled? And Wikipedia 1 in 5? Sorry - no matter which site is better or not, there's no way that the number of misspellings is anywhere near that high on any of those sites.

Yeah, the spellings are by far the least accurate results. Our professer gave us the dictionary that's attached. This is what we did to determine if something was correct or not: convert it to lower case, look it up in the dictionary (which we stored in a hash map). If it wasn't there, and the word ended in apostrophe-S, take the root part of the word (without the apostrophe-S) and look up the root. If the root wasn't there, then we marked it as incorrect. All words were stored in another hash map with an associated counter that would count how many times it appeared (we had two hash maps, one for correct words and one for misspelled words).

Crawling was done in a breadth-first manner. All URLs were visited, but we checked the header before downloading, and if the Content-Type wasn't "text/html" we aborted the download and visited the next URL. URL fragment IDs were removed from URLs to avoid visiting the same page twice.

[edit]

And for what it's worth, I think about 17% of the class said they accidentally DOS'd someone at some point...

[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

_mark_

1,108

February 29, 2012 03:26 PM

I guess there's the question of whether this is something your professor told you to do with that dictionary (in which case, you've done the technical achievement of actually implementing it, and that it was a poor way of doing it is another matter, and not your issue anyway), or if you were set the more general task of judging misspellings.

Some improvements (aside from starting with a better dictionary:)) I can think of:

* Manually check top misspellings of each site, and add all the actually correct spellings to the dictionary, then repeat (perhaps several times) (there's a risk of bias with this method I think, but it will at least get you something more likely to be meaningful).

* Ignore one letter words.

* Keep clear of any sites that might be using non-US English.

* Judging by http, co, org, it looks like you're not filtering out text in HTML/XML tags? That would seem worth doing (and is something that would seem part of the programming exercise).

http://erebusrpg.sourceforge.net/ - Erebus, Open Source RPG for Windows/Linux/Android
http://conquests.sourceforge.net/ - Conquests, Open Source Civ-like Game for Windows/Linux

Cornstalks

Author

7,033

February 29, 2012 03:43 PM

I guess there's the question of whether this is something your professor told you to do with that dictionary (in which case, you've done the technical achievement of actually implementing it, and that it was a poor way of doing it is another matter, and not your issue anyway), or if you were set the more general task of judging misspellings.

Well, our previous assignment was to spellcheck a document and suggest corrections, and while there wasn't a strict method we had to use to determine if a word was correct or not, the dictionary he gave needed to form the foundation of that. So we used just about the same spell checking method as on the last assignment.

* Judging by http, co, org, it looks like you're not filtering out text in HTML/XML tags? That would seem worth doing (and is something that would seem part of the programming exercise).

We filtered out everything in script and style tags and HTML comments. Once those were all removed, we stripped out any remaining HTML tags and HTML special characters (i.e. everything between < and > and things like  ). Then we used what was left and assumed everything matching the pattern "\s*([A-z]+('[A-z]+)?)\s*" was a word. So "something.else" would show up as two words, which is why http, co, org, com, www, etc show up so much.

But yes, I really like your ideas and would definitely implement them if we add more. I just about added the word "a" to the dictionary because it's almost always the most "misspelled" word, but I figured I'd leave it and just discuss the fact that our dictionary was probably our weakest part in our project report.

[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

Antheus

2,410

February 29, 2012 03:49 PM

a: 294011
[/quote]

How does one misspell letter 'a'?

_the_phantom_

11,263

February 29, 2012 03:51 PM

Well, 's' is pretty close on the keyboard...

rip-off

11,000

February 29, 2012 04:01 PM

We had a code freeze last Thursday, so we didn't have time to implement saving data from the crawler to either resume crawling later or perform more post processing on the data. If we continue to develop the crawler further, this is one feature that I'd really like to add.

Ok, cool. Any idea how long it took to scrape the 10910 pages listed on Gamedev?

And it wasn't 11.1% of all GameDev links pointing to Arteria's profile, it was 11.1% of the top 10 links (which we probably could've made more clear in the output report), which is actually more like 11.1% of some smaller percentage (we later added a feature to tell us what percentage of links the top 10 links account for, but that was after this crawl of GameDev). For example, Amazon's top 10 links account for 4.6% of all their links, so I imagine GameDev's is something similar (so it'd be more like 11.1% of ~5%).
[/quote]
That makes sense, I should have guessed that.

blog | twitter

Wan

1,366

February 29, 2012 04:23 PM

I actually did. We found the top 100 correctly spelled and misspelled words. From crawling 7438 English Wikipedia pages, we found the following to be the top "misspelled" (I say "misspelled" because, like I said, we weren't given a very good dictionary, so there are a number of actually correct words)

If you filter out words that where mistakenly converted to lowercase (February, American), names (Robert, Wikipedia), individual letters (a, b, c..), acronyms and abbreviations (ISBN, UTC, Inc..), URLs and files (http, www, org, pdf...) and the Wikipedia related words (wikiproject, oldid, sharealike) I'm not sure if it contains any mispellings at all to be honest. :)

Cornstalks

Author

7,033

February 29, 2012 04:28 PM

Ok, cool. Any idea how long it took to scrape the 10910 pages listed on Gamedev?

Hmmm... I'm trying to remember (that's another thing I'd like to add to the report: running time)... Well, I know crawling Amazon with 4 threads visiting 41,632 pages took 6 hours. I think GameDev took ~7 hours with one thread (I accidentally just cleared my bash history so I can't be exact with thtat).

@Wan: Yeah, for most of the sites their "top misspelled words" are really just "the most common words not in our sub-standard dictionary"

[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

Cornstalks

Author

7,033

February 29, 2012 04:30 PM

a: 294011

How does one misspell letter 'a'?
[/quote]
Just like that, apparently.

[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

Antheus

2,410

February 29, 2012 05:36 PM

Well, 's' is pretty close on the keyboard...

Sure, but how does one determine it was 'a' instead of 's' and not 'q' or 'b'.

Or how does one distinguish between 'then' and 'than'?

A dictionary lookup in a hashmap isn't a valid criteria, at least it's not checking for spelling, but merely tests against a known set of words.

Some GaveDev Stats...

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Some GaveDev Stats...

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines