Why are there ~1500 duplicate words here?

Question

Why are there ~1500 duplicate words here?

farzher opened this issue 9 years ago · comments

Shouldn't the list be deduplicated?

Kyle McDonald · Answer 1 · Fri Jun 26 2015 11:55:28 GMT+0800 (China Standard Time)

Yes, it looks like 20k has 1470 duplicates, and 10 in the usa file:

$ wc -l < 20k.txt 
   19999
$ sort 20k.txt | uniq | wc -l
   18529
$ wc -l < google-10000-english-usa.txt 
    9999
$ sort google-10000-english-usa.txt | uniq | wc -l
    9989

David Whitten · Answer 2 · Fri Jun 26 2015 23:55:30 GMT+0800 (China Standard Time)

I don't know. Is it a case sensitivity issue?
Does sort or uniq only get one of "this" and "This" ?

ヽ◕◡◕✿ノ · Answer 3 · Sat Jun 27 2015 01:14:05 GMT+0800 (China Standard Time)

It's not a case issue. It's exact duplicates. Check using any random dedupe tool.

Apparently Word is in there 9 times

KOSEKI Kengo · Answer 4 · Mon Jul 18 2016 13:42:46 GMT+0800 (China Standard Time)

It seems to combine two different sources into 20k.txt.

I'm checking the frequency rankings of this list using 20k.txt, and the result is this.

The original count_1w.txt shows the straight graph.

Josh Kaufman · Answer 5 · Tue Jul 19 2016 01:23:48 GMT+0800 (China Standard Time)

Great catch - not sure why the the original source has duplicates. I appreciate the fix.