first20hours / google-10000-english

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why are there ~1500 duplicate words here?

farzher opened this issue · comments

Shouldn't the list be deduplicated?

Yes, it looks like 20k has 1470 duplicates, and 10 in the usa file:

$ wc -l < 20k.txt 
   19999
$ sort 20k.txt | uniq | wc -l
   18529
$ wc -l < google-10000-english-usa.txt 
    9999
$ sort google-10000-english-usa.txt | uniq | wc -l
    9989

I don't know. Is it a case sensitivity issue?
Does sort or uniq only get one of "this" and "This" ?

It's not a case issue. It's exact duplicates. Check using any random dedupe tool.

image

Apparently Word is in there 9 times

It seems to combine two different sources into 20k.txt.

I'm checking the frequency rankings of this list using 20k.txt, and the result is this.

freq-g20k

The original count_1w.txt shows the straight graph.

freq

shot 186

Great catch - not sure why the the original source has duplicates. I appreciate the fix.