osirrc / jig

Jig for the Open-Source IR Replicability Challenge (OSIRRC)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Roubst04 (Disk4/5) manifest

lintool opened this issue · comments

Attached is the output of $ find . -type f | sort | xargs md5sum

Please let me know if your copy is different in non-trivial ways (e.g., name casing).

disk45.md5.txt

My copy has only 4 files: fbis.gz fr.gz ft.gz latimes.gz

As far as I know Robust04 does not contain cr. From TREC website:

The document collection for the Robust track is the set of documents on both TREC Disks 4 and 5 minus the the Congressional Record on disk 4.

	Source		    # Docs    Size (MB)
    Financial Times 	    210,158 	564
    Federal Register 94      55,630	395
    FBIS, disk 5   	    130,471 	470
    LA Times                131,896 	475

    Total Collection:	    528,155    1904

Source: https://trec.nist.gov/data/robust/04.guidelines.html

Yes, the disks had CR on them, but CR is not part of the evaluation. What I've uploaded is the manifest of the complete disks... I'm assuming systems will suppress CR themselves...

I have been thinking about this and I believe it will simplify our work if we can assume that whatever files are contained by Roubust04 folder are the only ones that are actually needed.

For example, if the collection name provided is Roubust04 I would expect to have a folder /input/collections/Roubust04 which contains only the .gz files needed (any number of files) and does not contain anything related to cr.

In the following examples, Jassv2 is indexing on a file-by-file approach, while Anserini is doing it on a folder base. Naturally Anserini will have a bigger index, but this is due to the fact that is indexing more than needed (not really fair I guess...).

https://github.com/osirrc2019/jassv2-docker/blob/15d106970d88d2807621f5fec7b9d0acfcca9da2/index_robust04#L7

https://github.com/osirrc2019/anserini-docker/blob/e7ede77ffa73f5f0092e67576ec074b7f27432b7/index#L19

But the potential issue is that this would make it harder to convey the contents of the directory. We can't share the files directly, but we can assume that everyone can get hold of the data from NIST...

This is fine as long as we know what the structure is... How about we add it in the Readme?

Can you take the manifest attached to this issue, find somewhere reasonable in the repo to put it, and send a PR?

I am very confused by the provided list of files. I am wondering if we can you a newer version for this workshop.

Here a couple of odd examples:

  • what is 1z or 0z?
./disk4/fr94/10/fr941007.1z
./disk4/fr94/10/fr941007.2z
./disk4/fr94/10/fr941011.0z
  • do we need to index C files? I believe this is auxiliary data, so probably not, but do we really need to have it there then?
./disk4/fr94/aux/frcheck.c
  • is this a readme or an actual file that needs to be indexed?
./disk4/cr/hfiles/readmeh.z

Hrm. This is what I have in my copy (copied from original disks 4+5)... can someone else e.g., @andrewtrotman who also has access to the original disks either verify?

I run uncompress and it seems to work fine...

$ uncompress -c fr941003.0z | head
<DOC>
<DOCNO> FR941003-0-00001 </DOCNO>
<PARENT> FR941003-0-00001 </PARENT>
<TEXT>
 
<!-- PJG FTAG 4700 -->

<!-- PJG STAG 4700 -->

<!-- PJG ITAG l=90 g=1 f=1 -->
...

Yes the cdroms had compressed files (.Z)

I can check later. I guess some ppl just got the collection somehow in different distribution format...

@arjenpdevries can you check if you copy has the weird file names?

At least it is not called roubst :-)

My copy has exactly the same list of files (or more), validated using:

ln -s TREC_VOL5 disk5
ln -s TREC_VOL_4 disk4
cut -d ' ' -f3 disk45.md5.txt | xargs ls > /dev/null 

Note that the cdroms had weird inconsistent labels (trying to prove I'm an old dog).

@amallia does this address your concerns? just plow through using deflate and you should be fine...?

PS:

[arjen@apc TREC]$ zcat ./disk4/cr/hfiles/readmeh.z
A Note to the User

The material on this disk is copyrighted and is subject to the terms and 
conditions of the TREC-96 Information-Retrieval Text Research Collection User 
Agreement, which must be signed in order to obtain a copy of the CD-ROM on 
which this data is to be found.

The changes between the original material as it came from the publisher and the 
version on this disk is detailed in the following file: readmeh.

[...]

The datasets have all been compressed using the UNIX compress utility and are 
stored in chunks of about 1 megabyte each (uncompressed size).

[..]

Special thanks should go to Dean Wilder at the Library of Congress for 
providing the data.

I do not think there is an easy rule that sais "newsfile" or "readme / other" based on the filename.

I do not think there is an easy rule that sais "newsfile" or "readme / other" based on the filename.

This one was my main concern, but I guess I can index everything...at least for now.

Closing this as #94 is adding directory tree and hashes for all collections.