Additional collections: NYT and Washington Post
lintool opened this issue · comments
We're proposing to use the collections for TREC Common Core 17 and 18. The tarballs we've downloaded from NIST are as follows:
jimmylin@tuna:/tuna1/collections/newswire$ md5sum WashingtonPost.v2.tar.gz
ce6e93f6ce9959b72c2de4f8d12089ab WashingtonPost.v2.tar.gz
jimmylin@tuna:/tuna1/collections/newswire$ md5sum NYTcorpus.tar.gz
09eed6502fc9c2e27ab247b675a783d5 NYTcorpus.tar.gz
Closing this as we're document it in #94
AFAIK the TREC Common Core 17 files came from LDC, not NIST. According to https://trec-core.github.io/2017/, it was the dataset https://catalog.ldc.upenn.edu/LDC2008T19. To make a long story short: The checksums are different, as is the filename provided by LDC:
pschaer@linux2:/datasets/NYT$ md5sum nyt_corpus_LDC2008T19.tgz
67a1bcf200c448424bf0fba34cef17b0 nyt_corpus_LDC2008T19.tgz
Are we talking about the same dataset? I am 99% sure... But the different filename and checksum left me a little suspicious.
How's this - let's verify the contents once you unpack...
$ find . -type f | sort | xargs md5sum > ~/NYTcorpus.md5sum.txt
See attached above.
Strange... In my version (fresh download from LDC) there is an additional folder 01
with a different timestamp:
pschaer@linux2:/datasets/NYT/nyt_corpus/data/1987$ ll
total 154444
drwxr-sr-x 2 pschaer datasets 4096 Dec 22 2009 01
-rwxr-xr-x 1 pschaer datasets 13128477 Aug 5 2008 01.tgz
...
-rwxr-xr-x 1 pschaer datasets 13222894 Aug 5 2008 12.tgz
pschaer@linux2:/datasets/NYT/nyt_corpus/data/1987$ cd 01/
pschaer@linux2:/datasets/NYT/nyt_corpus/data/1987/01$ ll
total 61940
-rwxr--r-- 1 pschaer datasets 63426560 Aug 5 2008 01.tar
The content of 01.tar
is the same as of 01.tgz
.
Anyhow... The rest of the hashes is the same:
schaer@touchbot:~/Downloads$ diff NYTcorpus.md5sum.txt NYTcorpus.md5sum-2.txt
1d0
< 037430a068a5241135f8d3284091b3c5 ./data/1987/01/01.tar
schaer@touchbot:~/Downloads$
BUT: Maybe we should think about including hashes to ensure the correctness of test collections. Think WAPost v1 and v2 which might get quickly confused otherwise.