Words on Wikipedia
33,535,571 × | the the the the the the the the the the the |
17,093,829 × | of of of of of of |
13,738,373 × | in in in in in |
13,188,068 × | and and and and |
10,759,896 × | a a a a |
10,443,624 × | to to to |
5,826,723 × | is is is is is is is is is is is is is is is is is is is |
5,008,425 × | was was was was was was was was was was was was was was was was was |
4,357,494 × | on on on on on on on on on on on on on on on |
4,337,447 × | for for for for for for for for for for for for for for |
Notes
What are the top 10 words on Wikipedia? How many times does the word word appear? Which 13-letter word is used 4 times more often than any other? Which of the words who, when, where, what and why is seen most frequently?
Wikipedia is an intriguing source of data. It's often quirky: the most common 9-letter word on Wikipedia is Wikipedia. It's often messy: of the top 10 words beginning with x, one is XML, one is Xbox, and another four are Roman numerals. But for sheer quantity of open data, it's irresistable.
Scale
Each word represents 3,000,000 words
Each word represents 300,000 words
Each word represents 30,000 words
Each word represents 3,000 words
Each word represents 300 words
Each word represents 30 words
Each word represents 3 words
Each word represents 1 word
Subscribe now and I’ll let you know whenever I create a new visualization
It’ll only be every couple of months or so, I won’t let anyone else have your email address, and you can unsubscribe at any time
Thanks for subscribing!
Check your inbox for an email to confirm your subscription
Oh no, something went wrong, and I was unable to subscribe you!
Please refresh your browser and try again
Sources
I counted words on Wikipedia as follows:
1. I downloaded a snapshot of English-language Wikipedia from 1 April 2017.
2. I considered only articles, templates, media/file descriptions and primary meta-pages. I omitted edit history and discussion.
3. I analyzed only the core text. I did not consider headings or captions, since these would have skewed the results. For example, the words "further", "reading", "external" and "links" would have been massively overrepresented had I included headings such as "Further reading" and "External links".
4. I stripped HTML, Wiki markup and other non-English-language elements as far as possible.
5. I excluded words with accents. It's a shame to have lost words like café and résumé, but it simplified the task significantly.
6. I treated different parts of a word as different words (e.g. think, thinks, thinking, thought and thoughts as different words).
7. I split hyphenated words into separate words (e.g. up-to-date into three separate words, up, to and date), but kept words with apostrophes as single words (e.g. don't), stripping any apostrophe or apostrophe-s from the end of a word (e.g. reducing Einstein's to Einstein).
8. I made no attempt to remove non-words. For example, the words st (from the ordinal 1st), utc (from the abbreviation UTC), www (from URLs) and e (from e. e. cummings and other initials) are all in there.
9. I analyzed a huge amount of data, not all of it clean, so there are many peculiarities in the results.
Date
First published 22 May 2017