Roubst04 (Disk4/5) manifest
lintool opened this issue · comments
Attached is the output of $ find . -type f | sort | xargs md5sum
Please let me know if your copy is different in non-trivial ways (e.g., name casing).
My copy has only 4 files: fbis.gz fr.gz ft.gz latimes.gz
As far as I know Robust04
does not contain cr
. From TREC website:
The document collection for the Robust track is the set of documents on both TREC Disks 4 and 5 minus the the Congressional Record on disk 4.
Source # Docs Size (MB)
Financial Times 210,158 564
Federal Register 94 55,630 395
FBIS, disk 5 130,471 470
LA Times 131,896 475
Total Collection: 528,155 1904
Source: https://trec.nist.gov/data/robust/04.guidelines.html
Yes, the disks had CR on them, but CR is not part of the evaluation. What I've uploaded is the manifest of the complete disks... I'm assuming systems will suppress CR themselves...
I have been thinking about this and I believe it will simplify our work if we can assume that whatever files are contained by Roubust04
folder are the only ones that are actually needed.
For example, if the collection name provided is Roubust04
I would expect to have a folder /input/collections/Roubust04
which contains only the .gz
files needed (any number of files) and does not contain anything related to cr
.
In the following examples, Jassv2 is indexing on a file-by-file approach, while Anserini is doing it on a folder base. Naturally Anserini will have a bigger index, but this is due to the fact that is indexing more than needed (not really fair I guess...).
But the potential issue is that this would make it harder to convey the contents of the directory. We can't share the files directly, but we can assume that everyone can get hold of the data from NIST...
This is fine as long as we know what the structure is... How about we add it in the Readme?
Can you take the manifest attached to this issue, find somewhere reasonable in the repo to put it, and send a PR?
I am very confused by the provided list of files. I am wondering if we can you a newer version for this workshop.
Here a couple of odd examples:
- what is
1z
or0z
?
./disk4/fr94/10/fr941007.1z
./disk4/fr94/10/fr941007.2z
./disk4/fr94/10/fr941011.0z
- do we need to index C files? I believe this is auxiliary data, so probably not, but do we really need to have it there then?
./disk4/fr94/aux/frcheck.c
- is this a readme or an actual file that needs to be indexed?
./disk4/cr/hfiles/readmeh.z
Hrm. This is what I have in my copy (copied from original disks 4+5)... can someone else e.g., @andrewtrotman who also has access to the original disks either verify?
I run uncompress and it seems to work fine...
$ uncompress -c fr941003.0z | head
<DOC>
<DOCNO> FR941003-0-00001 </DOCNO>
<PARENT> FR941003-0-00001 </PARENT>
<TEXT>
<!-- PJG FTAG 4700 -->
<!-- PJG STAG 4700 -->
<!-- PJG ITAG l=90 g=1 f=1 -->
...
Yes the cdroms had compressed files (.Z)
I can check later. I guess some ppl just got the collection somehow in different distribution format...
@arjenpdevries can you check if you copy has the weird file names?
At least it is not called roubst :-)
My copy has exactly the same list of files (or more), validated using:
ln -s TREC_VOL5 disk5
ln -s TREC_VOL_4 disk4
cut -d ' ' -f3 disk45.md5.txt | xargs ls > /dev/null
Note that the cdroms had weird inconsistent labels (trying to prove I'm an old dog).
@amallia does this address your concerns? just plow through using deflate
and you should be fine...?
PS:
[arjen@apc TREC]$ zcat ./disk4/cr/hfiles/readmeh.z
A Note to the User
The material on this disk is copyrighted and is subject to the terms and
conditions of the TREC-96 Information-Retrieval Text Research Collection User
Agreement, which must be signed in order to obtain a copy of the CD-ROM on
which this data is to be found.
The changes between the original material as it came from the publisher and the
version on this disk is detailed in the following file: readmeh.
[...]
The datasets have all been compressed using the UNIX compress utility and are
stored in chunks of about 1 megabyte each (uncompressed size).
[..]
Special thanks should go to Dean Wilder at the Library of Congress for
providing the data.
I do not think there is an easy rule that sais "newsfile" or "readme / other" based on the filename.
I do not think there is an easy rule that sais "newsfile" or "readme / other" based on the filename.
This one was my main concern, but I guess I can index everything...at least for now.
Closing this as #94 is adding directory tree and hashes for all collections.