Advertisement

Some GaveDev Stats...

Started by February 29, 2012 02:12 AM
29 comments, last by _the_phantom_ 12 years, 8 months ago

Can you find out which are the most misspelled words on Wikipedia?


I too would find that interesting.

Obviously we are going to have fewer misspellings than blogs or Wikipedia since the unintelligent and uneducated masses generally contribute to those sites a lot more.


And with comments like this you wonder why anyone fails to take you even remotely seriously....?

I can't speak for others but I've taken to looking at posts with you involved just to see what utterly stupid thing you come out with next... and you never fail to disappoint...
Advertisement
This clearly proves you are an ignorant troll trying to do nothing more than provoke a response from me.[/quote]

This 'clearly proves' nothing.

I'm not looking to provoke a reply, I'm looking to satisfy my compulsion to correct falsehoods. If I can help you learn that projecting intellectual elitism while then spouting hard truths without even directly supporting evidence makes you sound like a poser, well so much the better.
I am a good speller by asociation!

PROBLEM ENGLISH?

Obviously we are going to have fewer misspellings than blogs or Wikipedia since the unintelligent and uneducated masses generally contribute to those sites a lot more. This all seems intuitively apparent to me.

Out of curiosity, have you tried to post anything to wikipedia? It's not as easy as people think to get stuff approved.

[quote name='SteveDeFacto' timestamp='1330484649' post='4917602']
Obviously we are going to have fewer misspellings than blogs or Wikipedia since the unintelligent and uneducated masses generally contribute to those sites a lot more. This all seems intuitively apparent to me.

Out of curiosity, have you tried to post anything to wikipedia? It's not as easy as people think to get stuff approved.
[/quote]

I'm actually a regular contributor. The information is generally accurate but spelling and grammar errors go largely unfixed in some of the less popular topics.
Advertisement
Obviously we are going to have fewer misspellings than ... Wikipedia since the unintelligent and uneducated masses generally contribute to those sites a lot more. This all seems intuitively apparent to me.
What planet do you live on? GD has a bunch of kids who use txt spk, wikipedia has a bunch of people who correct errors.

Based on the relatively high number of mis-spellings on BBC, I doubt the validity of the spell-checking anyway.

Can you find out which are the most misspelled words on Wikipedia?

I actually did. We found the top 100 correctly spelled and misspelled words. From crawling 7438 English Wikipedia pages, we found the following to be the top "misspelled" (I say "misspelled" because, like I said, we weren't given a very good dictionary, so there are a number of actually correct words):



a: 294011
wikipedia: 75202
http: 57955
s: 48640
p: 33781
e: 33692
c: 33411
www: 32385
b: 30649
com: 29800
d: 28831
org: 28003
i: 27804
th: 20211
n: 19956
t: 19104
february: 18775
isbn: 18139
american: 18139
m: 16454
v: 15754
l: 14945
r: 14578
w: 14422
g: 14269
wikimedia: 13841
u: 13779
january: 12967
non: 12894
f: 12623
html: 12121
pdf: 11552
british: 11166
j: 10553
utc: 10485
uk: 10331
york: 10146
december: 9938
april: 9930
october: 9848
september: 9471
pp: 9470
h: 9392
july: 9378
june: 9359
k: 9202
php: 9114
november: 8976
x: 8439
european: 7991
co: 7980
wiki: 7899
wikiproject: 7833
st: 7744
inc: 7270
o: 7076
europe: 6925
london: 6624
america: 6568
africa: 6281
chinese: 6119
latin: 6111
bbc: 6003
google: 5821
internet: 5651
namespaces: 5600
oldid: 5532
sharealike: 5412
canada: 5354
australia: 5299
spanish: 5216
india: 5189
england: 4937
ii: 4727
y: 4718
france: 4715
african: 4654
australian: 4637
germany: 4554
david: 4512
contribs: 4458
gov: 4418
california: 4280
htm: 4214
fran: 4203
cambridge: 4141
william: 4053
italian: 3921
asia: 3911
san: 3814
ireland: 3763
james: 3728
indian: 3727
zealand: 3699
espa: 3585
britain: 3563
norsk: 3557
q: 3540
robert: 3523
ol: 3498


I've attached the results output from our crawler for several of the sites we crawled. The amazon one is kinda cool, because we also tracked customer reviews.
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]
Did your scraper examine only these metrics, or did you keep all the data, allowing you to examine it offline for other interesting statistics?

I find it hard to believe that 11.1% of links on gamedev.net point to this user's profile (edit: that user is a marketplace seller, so maybe that explains it).

All of this data seems obvious to me. The word "the" is probably the most commonly used word in the English language. Gamedev.net obviously would score lower than the other sites for words per image.
Yes, in particular every comment having a profile image.

Obviously we are going to have fewer misspellings than blogs or Wikipedia since the unintelligent and uneducated masses generally contribute to those sites a lot more. This all seems intuitively apparent to me.[/quote]I'm not sure that's obvious at all. All these three sites are edited by ordinary people, and it's not clear why some sites are obviously more likely to attract misspellings more than others. Maybe programmers are better at spelling than average - but whilst that's nice to hear, and not unreasonable to me, it goes against an obvious stereotype (since people would usually associate spelling with language, not geeky subjects). But the thing with Wikipedia is that misspellings can be corrected - so if it were really true, it would be interesting if Gamedev as a whole (including forums) does better than Wikipedia.

But I'd like to see how this was measured - were a set of words used, or was it against a spell checker? If the former, were unknown words presumably ignored? If the latter, how were things like proper nouns handled (or simply words not in the dictionary)? And I hope that this wasn't biased towards US English - restricting to English Wikipedia isn't sufficient, as many pages use British English. The fact that the BBC comes out so bad makes me even more suspicious that this was an issue...

ETA: I also note the percentage. The BBC has 30% of words misspelled? And Wikipedia 1 in 5? Sorry - no matter which site is better or not, there's no way that the number of misspellings is anywhere near that high on any of those sites.

ETA2: AFAICT, not a *single* one of those 100 top "misspellings" is actually a misspelling (even the ones which aren't proper words still aren't misspellings - they're abbreviations, names, letters and so on, presumably perfectly valid in their context. I think that must be the worst spelling analysis I have ever seen smile.png (Not getting at you Cornstalks - the other stats were still interesting. And if anything, it was worth it just to see SteveDeFacto jump to his "This evidence supports my prejudice" claims again;)) What were the top 100 for the BBC, out of interest?

As a discussion point, what would be a good way to measure spellings/misspellings? And is there one that isn't hopelessly influenced by bias in the choice of test words or dictionary?

http://erebusrpg.sourceforge.net/ - Erebus, Open Source RPG for Windows/Linux/Android
http://conquests.sourceforge.net/ - Conquests, Open Source Civ-like Game for Windows/Linux

This topic is closed to new replies.

Advertisement