The top 10 words on Wikipedia are:
33,535,571 ×  the the the the the the the the the the the
17,093,829 ×  of of of of of of
13,738,373 ×  in in in in in
13,188,068 ×  and and and and
10,759,896 ×  a a a a
10,443,624 ×  to to to
5,826,723 ×  is is is is is is is is is is is is is is is is is is is
5,008,425 ×  was was was was was was was was was was was was was was was was was
4,357,494 ×  on on on on on on on on on on on on on on on
4,337,447 ×  for for for for for for for for for for for for for for
There's a total of 538,692,767 words on Wikipedia


What are the top 10 words on Wikipedia? How many times does the word word appear? Which 13-letter word is used 4 times more often than any other? Which of the words who, when, where, what and why is seen most frequently?

Wikipedia is an intriguing source of data. It's often quirky: the most common 9-letter word on Wikipedia is Wikipedia. It's often messy: of the top 10 words beginning with x, one is XML, one is Xbox, and another four are Roman numerals. But for sheer quantity of open data, it's irresistable.


I counted words on Wikipedia as follows:

1. I downloaded a snapshot of English-language Wikipedia from 1 April 2017.

2. I considered only articles, templates, media/file descriptions and primary meta-pages. I omitted edit history and discussion.

3. I analyzed only the core text. I did not consider headings or captions, since these would have skewed the results. For example, the words "further", "reading", "external" and "links" would have been massively overrepresented had I included headings such as "Further reading" and "External links".

4. I stripped HTML, Wiki markup and other non-English-language elements as far as possible.

5. I excluded words with accents. It's a shame to have lost words like café and résumé, but it simplified the task significantly.

6. I treated different parts of a word as different words (e.g. think, thinks, thinking, thought and thoughts as different words).

7. I split hyphenated words into separate words (e.g. up-to-date into three separate words, up, to and date), but kept words with apostrophes as single words (e.g. don't), stripping any apostrophe or apostrophe-s from the end of a word (e.g. reducing Einstein's to Einstein).

8. I made no attempt to remove non-words. For example, the words st (from the ordinal 1st), utc (from the abbreviation UTC), www (from URLs) and e (from e. e. cummings and other initials) are all in there.

9. I analyzed a huge amount of data, not all of it clean, so there are many peculiarities in the results.


First published 22 May 2017

