Words on Wikipedia

The top 10 words on Wikipedia are:

33,535,571 ×	the the the the the the the the the the the
17,093,829 ×	of of of of of of
13,738,373 ×	in in in in in
13,188,068 ×	and and and and
10,759,896 ×	a a a a
10,443,624 ×	to to to
5,826,723 ×	is is is is is is is is is is is is is is is is is is is
5,008,425 ×	was was was was was was was was was was was was was was was was was
4,357,494 ×	on on on on on on on on on on on on on on on
4,337,447 ×	for for for for for for for for for for for for for for

There's a total of 538,692,767 words on Wikipedia

Notes

What are the top 10 words on Wikipedia? How many times does the word word appear? Which 13-letter word is used 4 times more often than any other? Which of the words who, when, where, what and why is seen most frequently?

Wikipedia is an intriguing source of data. It's often quirky: the most common 9-letter word on Wikipedia is Wikipedia. It's often messy: of the top 10 words beginning with x, one is XML, one is Xbox, and another four are Roman numerals. But for sheer quantity of open data, it's irresistable.

Scale

Each word represents 3,000,000 words

Each word represents 300,000 words

Each word represents 30,000 words

Each word represents 3,000 words

Each word represents 300 words

Each word represents 30 words

Each word represents 3 words

Each word represents 1 word

Comments

Click here to leave a comment

Would you like to see more of my visualizations?

Subscribe now and I’ll let you know whenever I create a new visualization

It’ll only be every couple of months or so, I won’t let anyone else have your email address, and you can unsubscribe at any time

Thanks for subscribing!

Check your inbox for an email to confirm your subscription

Oh no, something went wrong, and I was unable to subscribe you!

Please refresh your browser and try again

More language made thinkable

Latest things made thinkable

Sources

Wikipedia

I counted words on Wikipedia as follows:

1. I downloaded a snapshot of English-language Wikipedia from 1 April 2017.

2. I considered only articles, templates, media/file descriptions and primary meta-pages. I omitted edit history and discussion.

3. I analyzed only the core text. I did not consider headings or captions, since these would have skewed the results. For example, the words "further", "reading", "external" and "links" would have been massively overrepresented had I included headings such as "Further reading" and "External links".

4. I stripped HTML, Wiki markup and other non-English-language elements as far as possible.

5. I excluded words with accents. It's a shame to have lost words like café and résumé, but it simplified the task significantly.

6. I treated different parts of a word as different words (e.g. think, thinks, thinking, thought and thoughts as different words).

7. I split hyphenated words into separate words (e.g. up-to-date into three separate words, up, to and date), but kept words with apostrophes as single words (e.g. don't), stripping any apostrophe or apostrophe-s from the end of a word (e.g. reducing Einstein's to Einstein).

8. I made no attempt to remove non-words. For example, the words st (from the ordinal 1st), utc (from the abbreviation UTC), www (from URLs) and e (from e. e. cummings and other initials) are all in there.

9. I analyzed a huge amount of data, not all of it clean, so there are many peculiarities in the results.

Date

First published 22 May 2017

other pages on things made thinkable


How far can you travel for $10?		Antarctic Exploration Timeline		Top Ten Cities Through History

brought to you by Kootenay Village Ventures Inc.