Words on Wikipedia

The top 10 words on Wikipedia are:
33,535,571 ×  the the the the the the the the the the the
17,093,829 ×  of of of of of of
13,738,373 ×  in in in in in
13,188,068 ×  and and and and
10,759,896 ×  a a a a
10,443,624 ×  to to to
5,826,723 ×  is is is is is is is is is is is is is is is is is is is
5,008,425 ×  was was was was was was was was was was was was was was was was was
4,357,494 ×  on on on on on on on on on on on on on on on
4,337,447 ×  for for for for for for for for for for for for for for
There's a total of 538,692,767 words on Wikipedia

Notes

What are the top 10 words on Wikipedia? How many times does the word word appear? Which 13-letter word is used 4 times more often than any other? Which of the words who, when, where, what and why is seen most frequently?

Wikipedia is an intriguing source of data. It's often quirky: the most common 9-letter word on Wikipedia is Wikipedia. It's often messy: of the top 10 words beginning with x, one is XML, one is Xbox, and another four are Roman numerals. But for sheer quantity of open data, it's irresistable.

Scale

Each word represents 3,000,000 words

Each word represents 300,000 words

Each word represents 30,000 words

Each word represents 3,000 words

Each word represents 300 words

Each word represents 30 words

Each word represents 3 words

Each word represents 1 word

Comments

Sources

Wikipedia

I counted words on Wikipedia as follows:

1. I downloaded a snapshot of English-language Wikipedia from 1 April 2017.

2. I considered only articles, templates, media/file descriptions and primary meta-pages. I omitted edit history and discussion.

3. I analyzed only the core text. I did not consider headings or captions, since these would have skewed the results. For example, the words "further", "reading", "external" and "links" would have been massively overrepresented had I included headings such as "Further reading" and "External links".

4. I stripped HTML, Wiki markup and other non-English-language elements as far as possible.

5. I excluded words with accents. It's a shame to have lost words like café and résumé, but it simplified the task significantly.

6. I treated different parts of a word as different words (e.g. think, thinks, thinking, thought and thoughts as different words).

7. I split hyphenated words into separate words (e.g. up-to-date into three separate words, up, to and date), but kept words with apostrophes as single words (e.g. don't), stripping any apostrophe or apostrophe-s from the end of a word (e.g. reducing Einstein's to Einstein).

8. I made no attempt to remove non-words. For example, the words st (from the ordinal 1st), utc (from the abbreviation UTC), www (from URLs) and e (from e. e. cummings and other initials) are all in there.

9. I analyzed a huge amount of data, not all of it clean, so there are many peculiarities in the results.

Date

First published 22 May 2017

other pages on things made thinkable

1   1   1
What Do We All Provide?   Baryons   Real World Colour Wheels