osirrc / jig

Jig for the Open-Source IR Replicability Challenge (OSIRRC)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Additional collections: NYT and Washington Post

lintool opened this issue · comments

We're proposing to use the collections for TREC Common Core 17 and 18. The tarballs we've downloaded from NIST are as follows:

jimmylin@tuna:/tuna1/collections/newswire$ md5sum WashingtonPost.v2.tar.gz 
ce6e93f6ce9959b72c2de4f8d12089ab  WashingtonPost.v2.tar.gz
jimmylin@tuna:/tuna1/collections/newswire$ md5sum NYTcorpus.tar.gz 
09eed6502fc9c2e27ab247b675a783d5  NYTcorpus.tar.gz

Closing this as we're document it in #94

AFAIK the TREC Common Core 17 files came from LDC, not NIST. According to https://trec-core.github.io/2017/, it was the dataset https://catalog.ldc.upenn.edu/LDC2008T19. To make a long story short: The checksums are different, as is the filename provided by LDC:

pschaer@linux2:/datasets/NYT$ md5sum nyt_corpus_LDC2008T19.tgz
67a1bcf200c448424bf0fba34cef17b0  nyt_corpus_LDC2008T19.tgz

Are we talking about the same dataset? I am 99% sure... But the different filename and checksum left me a little suspicious.

How's this - let's verify the contents once you unpack...

$ find . -type f | sort | xargs md5sum > ~/NYTcorpus.md5sum.txt

NYTcorpus.md5sum.txt

See attached above.

Strange... In my version (fresh download from LDC) there is an additional folder 01 with a different timestamp:

pschaer@linux2:/datasets/NYT/nyt_corpus/data/1987$ ll
total 154444
drwxr-sr-x 2 pschaer datasets     4096 Dec 22  2009 01
-rwxr-xr-x 1 pschaer datasets 13128477 Aug  5  2008 01.tgz
...
-rwxr-xr-x 1 pschaer datasets 13222894 Aug  5  2008 12.tgz
pschaer@linux2:/datasets/NYT/nyt_corpus/data/1987$ cd 01/
pschaer@linux2:/datasets/NYT/nyt_corpus/data/1987/01$ ll
total 61940
-rwxr--r-- 1 pschaer datasets 63426560 Aug  5  2008 01.tar

The content of 01.tar is the same as of 01.tgz.

Anyhow... The rest of the hashes is the same:

schaer@touchbot:~/Downloads$ diff NYTcorpus.md5sum.txt NYTcorpus.md5sum-2.txt
1d0
< 037430a068a5241135f8d3284091b3c5  ./data/1987/01/01.tar
schaer@touchbot:~/Downloads$

BUT: Maybe we should think about including hashes to ensure the correctness of test collections. Think WAPost v1 and v2 which might get quickly confused otherwise.