Roubst04 (Disk4/5) manifest

Question

Roubst04 (Disk4/5) manifest

lintool opened this issue 5 years ago · comments

Jimmy Lin commented 5 years ago

Attached is the output of $ find . -type f | sort | xargs md5sum

Please let me know if your copy is different in non-trivial ways (e.g., name casing).

disk45.md5.txt

Antonio Mallia · Answer 1 · Mon Apr 08 2019 13:04:03 GMT+0800 (China Standard Time)

My copy has only 4 files: fbis.gz fr.gz ft.gz latimes.gz

As far as I know Robust04 does not contain cr. From TREC website:

The document collection for the Robust track is the set of documents on both TREC Disks 4 and 5 minus the the Congressional Record on disk 4.

	Source		    # Docs    Size (MB)
    Financial Times 	    210,158 	564
    Federal Register 94      55,630	395
    FBIS, disk 5   	    130,471 	470
    LA Times                131,896 	475

    Total Collection:	    528,155    1904

Source: https://trec.nist.gov/data/robust/04.guidelines.html

Jimmy Lin · Answer 2 · Mon Apr 08 2019 19:31:03 GMT+0800 (China Standard Time)

Yes, the disks had CR on them, but CR is not part of the evaluation. What I've uploaded is the manifest of the complete disks... I'm assuming systems will suppress CR themselves...

Antonio Mallia · Answer 3 · Thu Apr 11 2019 01:22:30 GMT+0800 (China Standard Time)

I have been thinking about this and I believe it will simplify our work if we can assume that whatever files are contained by Roubust04 folder are the only ones that are actually needed.

For example, if the collection name provided is Roubust04 I would expect to have a folder /input/collections/Roubust04 which contains only the .gz files needed (any number of files) and does not contain anything related to cr.

In the following examples, Jassv2 is indexing on a file-by-file approach, while Anserini is doing it on a folder base. Naturally Anserini will have a bigger index, but this is due to the fact that is indexing more than needed (not really fair I guess...).

https://github.com/osirrc2019/jassv2-docker/blob/15d106970d88d2807621f5fec7b9d0acfcca9da2/index_robust04#L7

https://github.com/osirrc2019/anserini-docker/blob/e7ede77ffa73f5f0092e67576ec074b7f27432b7/index#L19

Jimmy Lin · Answer 4 · Thu Apr 11 2019 01:32:42 GMT+0800 (China Standard Time)

But the potential issue is that this would make it harder to convey the contents of the directory. We can't share the files directly, but we can assume that everyone can get hold of the data from NIST...

Antonio Mallia · Answer 5 · Thu Apr 11 2019 01:33:56 GMT+0800 (China Standard Time)

This is fine as long as we know what the structure is... How about we add it in the Readme?

Jimmy Lin · Answer 6 · Thu Apr 11 2019 01:35:12 GMT+0800 (China Standard Time)

Can you take the manifest attached to this issue, find somewhere reasonable in the repo to put it, and send a PR?

Antonio Mallia · Answer 7 · Thu Apr 11 2019 01:45:54 GMT+0800 (China Standard Time)

I am very confused by the provided list of files. I am wondering if we can you a newer version for this workshop.

Here a couple of odd examples:

what is 1z or 0z?

./disk4/fr94/10/fr941007.1z
./disk4/fr94/10/fr941007.2z
./disk4/fr94/10/fr941011.0z

do we need to index C files? I believe this is auxiliary data, so probably not, but do we really need to have it there then?

./disk4/fr94/aux/frcheck.c

is this a readme or an actual file that needs to be indexed?

./disk4/cr/hfiles/readmeh.z

Jimmy Lin · Answer 8 · Thu Apr 11 2019 01:54:23 GMT+0800 (China Standard Time)

Hrm. This is what I have in my copy (copied from original disks 4+5)... can someone else e.g., @andrewtrotman who also has access to the original disks either verify?

I run uncompress and it seems to work fine...

$ uncompress -c fr941003.0z | head
<DOC>
<DOCNO> FR941003-0-00001 </DOCNO>
<PARENT> FR941003-0-00001 </PARENT>
<TEXT>
 
<!-- PJG FTAG 4700 -->

<!-- PJG STAG 4700 -->

<!-- PJG ITAG l=90 g=1 f=1 -->
...

Arjen P. de Vries · Answer 9 · Thu Apr 11 2019 03:23:47 GMT+0800 (China Standard Time)

Yes the cdroms had compressed files (.Z)

I can check later. I guess some ppl just got the collection somehow in different distribution format...

Jimmy Lin · Answer 10 · Thu Apr 11 2019 03:25:53 GMT+0800 (China Standard Time)

@arjenpdevries can you check if you copy has the weird file names?

Arjen P. de Vries · Answer 11 · Thu Apr 11 2019 03:49:19 GMT+0800 (China Standard Time)

At least it is not called roubst :-)

My copy has exactly the same list of files (or more), validated using:

ln -s TREC_VOL5 disk5
ln -s TREC_VOL_4 disk4
cut -d ' ' -f3 disk45.md5.txt | xargs ls > /dev/null

Note that the cdroms had weird inconsistent labels (trying to prove I'm an old dog).

Jimmy Lin · Answer 12 · Thu Apr 11 2019 03:53:02 GMT+0800 (China Standard Time)

@amallia does this address your concerns? just plow through using deflate and you should be fine...?

Arjen P. de Vries · Answer 13 · Thu Apr 11 2019 04:13:06 GMT+0800 (China Standard Time)

PS:

[arjen@apc TREC]$ zcat ./disk4/cr/hfiles/readmeh.z
A Note to the User

The material on this disk is copyrighted and is subject to the terms and 
conditions of the TREC-96 Information-Retrieval Text Research Collection User 
Agreement, which must be signed in order to obtain a copy of the CD-ROM on 
which this data is to be found.

The changes between the original material as it came from the publisher and the 
version on this disk is detailed in the following file: readmeh.

[...]

The datasets have all been compressed using the UNIX compress utility and are 
stored in chunks of about 1 megabyte each (uncompressed size).

[..]

Special thanks should go to Dean Wilder at the Library of Congress for 
providing the data.

Arjen P. de Vries · Answer 14 · Thu Apr 11 2019 04:18:36 GMT+0800 (China Standard Time)

I do not think there is an easy rule that sais "newsfile" or "readme / other" based on the filename.

Antonio Mallia · Answer 15 · Thu Apr 11 2019 05:26:45 GMT+0800 (China Standard Time)

I do not think there is an easy rule that sais "newsfile" or "readme / other" based on the filename.

This one was my main concern, but I guess I can index everything...at least for now.

andrewtrotman · Answer 16 · Thu Apr 11 2019 05:50:26 GMT+0800 (China Standard Time)

Well, yes and no. The original filename on the CD-ROMs are in uppercase (on my CD-ROMs). Since uncompress requires an uppercase .Z extension, I can’t use uncompress on the files with the same name as the manifest Jimmy sent. So if I copy fr941003.0z to fr941003.0z.Z and then uncompress -c fr941003.0z.Z | head then I get the same as Jimmy. Following on this thread, the directories include CR - which we must exclude for robust04. They also include readme and DTD files and a load of other gunk - which we must exclude. For ATIREI use the following file list to index the collection without any of the other gunk: $COLLECTION/disk4/fr94/01 $COLLECTION/disk4/fr94/02 $COLLECTION/disk4/fr94/03 $COLLECTION/disk4/fr94/04 $COLLECTION/disk4/fr94/05 $COLLECTION/disk4/fr94/06 $COLLECTION/disk4/fr94/07 $COLLECTION/disk4/fr94/08 $COLLECTION/disk4/fr94/09 $COLLECTION/disk4/fr94/10 $COLLECTION/disk4/fr94/11 $COLLECTION/disk4/fr94/12 $COLLECTION/disk4/ft/ft911 $COLLECTION/disk4/ft/ft921 $COLLECTION/disk4/ft/ft922 $COLLECTION/disk4/ft/ft923 $COLLECTION/disk4/ft/ft924 $COLLECTION/disk4/ft/ft931 $COLLECTION/disk4/ft/ft932 $COLLECTION/disk4/ft/ft933 $COLLECTION/disk4/ft/ft934 $COLLECTION/disk4/ft/ft941 $COLLECTION/disk4/ft/ft942 $COLLECTION/disk4/ft/ft943 $COLLECTION/disk4/ft/ft944 $COLLECTION/disk5/fbis/fb* $COLLECTION/disk5/latimes/la* In a seperate post I’ll send the C++ source code that I use to uncompress the files before processing. I do it all in a single pipeline in the ATIRE indexing process so I read the .z (or .0z or .1z, etc) file, uncompress it and then break it into documents and index each all in the pipeline. Andrew.

…

On 11/04/2019, at 5:54 AM, Jimmy Lin ***@***.***> wrote: Hrm. This is what I have in my copy (copied from original disks 4+5)... can someone else e.g., @andrewtrotman <https://github.com/andrewtrotman> who also has access to the original disks either verify? I run uncompress and it seems to work fine... $ uncompress -c fr941003.0z | head <DOC> <DOCNO> FR941003-0-00001 </DOCNO> <PARENT> FR941003-0-00001 </PARENT> <TEXT>    ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIOOIl4DxepaT2uZHGFdW4-uu2AMNd-Bks5vfiVQgaJpZM4chAfa>.

andrewtrotman · Answer 17 · Thu Apr 11 2019 05:51:15 GMT+0800 (China Standard Time)

Here’s the C++ code I use to turn .Z files into text: /* unlzw version 1.4, 22 August 2015 Copyright (C) 2014, 2015 Mark Adler This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. Mark Adler madler@alumni.caltech.edu */ /* Version history: 1.0 28 Sep 2014 First version 1.1 1 Oct 2014 Cast before shift of bit buffer for portability Use fastest 32-bit type for bit buffer, uint_fast32_t Use uint_least16_t in case a 16-bit type is not available 1.2 3 Oct 2014 Clean up comments, consolidate return values 1.3 20 Aug 2015 Assure no out-of-bounds access on invalid input 1.4 22 Aug 2015 Return uncompressed data so far on error conditions Be more permissive on where the input is allowed to end */ #include <stdlib.h> #include <stdint.h> /* Type for accumulating bits. 23 bits of the register are used to accumulate up to 16-bit symbols. */ typedef uint_fast32_t bits_t; /* Double size_t variable n, saturating at the maximum size_t value. */ #define DOUBLE(n) \ do { \ size_t was = n; \ n <<= 1; \ if (n < was) \ n = (size_t)0 - 1; \ } while (0) /* Decompress compressed data generated by the Unix compress utility (LZW compression, files with suffix .Z). Decompress in[0..inlen-1] to an allocated buffer (*out)[0..*outlen-1]. The length of the uncompressed data in the allocated buffer is returned in *outlen. unlzw() returns zero on success, negative if the compressed data is invalid, or 1 if out of memory. The negative return values are -1 for an invalid header, -2 if the first code is not a literal or if an invalid code is detected, and -3 if the stream ended in the middle of a code. -1 means that the data was not produced by Unix compress, -2 generally means random or corrupted data, and -3 generally means prematurely terminated data. If the decompression results in a proper zero-length output, then unlzw() returns zero, *outlen is zero, and *out is NULL. On error, any decompressed data up to that point is returned using *out and *outlen. */ static int unlzw(unsigned const char *in, size_t inlen, unsigned char **out, size_t *outlen) { unsigned bits; /* current number of bits per code (9..16) */ unsigned mask; /* mask for current bits codes = (1<<bits)-1 */ bits_t buf; /* bit buffer -- holds up to 23 bits */ unsigned left; /* bits left in buf (0..7 after code pulled) */ size_t next; /* index of next input byte in in[] */ size_t mark; /* index where last change in bits began */ unsigned code; /* code, table traversal index */ unsigned max; /* maximum bits per code for this stream */ unsigned flags; /* compress flags, then block compress flag */ unsigned end; /* last valid entry in prefix/suffix tables */ unsigned prev; /* previous code */ unsigned final; /* last character written for previous code */ unsigned stack; /* next position for reversed string */ unsigned char *put; /* allocated output buffer */ size_t size; /* size of put[] allocation */ size_t have; /* number of bytes of data in put[] */ int ret = 0; /* return code */ /* memory for unlzw() -- the first 256 entries of prefix[] and suffix[] are never used, so could have offset the index but it's faster to waste a little memory */ uint_least16_t prefix[65536]; /* index to LZW prefix string */ unsigned char suffix[65536]; /* one-character LZW suffix */ unsigned char match[65280 + 2]; /* buffer for reversed match */ /* initialize output for error returns */ *out = NULL; *outlen = 0; /* process the header */ if (inlen < 3 || in[0] != 0x1f || in[1] != 0x9d) return -1; /* invalid header */ flags = in[2]; if (flags & 0x60) return -1; /* invalid header */ max = flags & 0x1f; if (max < 9 || max > 16) return -1; /* invalid header */ if (max == 9) /* 9 doesn't really mean 9 */ max = 10; flags &= 0x80; /* true if block compress */ /* clear table, start at nine bits per symbol */ bits = 9; mask = 0x1ff; end = flags ? 256 : 255; /* set up: get the first 9-bit code, which is the first decompressed byte, but don't create a table entry until the next code */ if (inlen == 3) return 0; /* zero-length input is ok */ buf = in[3]; if (inlen == 4) return -3; /* a partial code is not ok */ buf += in[4] << 8; final = prev = buf & mask; /* code */ buf >>= bits; left = 16 - bits; if (prev > 255) return -2; /* first code must be a literal */ /* we have output -- allocate and set up an output buffer four times the size of the input (Unix compress usually compresses less than 4:1, so this will avoid a reallocation most of the time) */ size = inlen; DOUBLE(size); DOUBLE(size); put = (unsigned char *)malloc(size); if (put == NULL) return 1; put[0] = final; /* first decompressed byte */ have = 1; /* decode codes */ mark = 3; /* start of compressed data */ next = 5; /* consumed five bytes so far */ stack = 0; /* empty stack */ while (next < inlen) { /* if the table will be full after this, increment the code size */ if (end >= mask && bits < max) { /* flush unused input bits and bytes to next 8*bits bit boundary (this is a vestigial aspect of the compressed data format derived from an implementation that made use of a special VAX machine instruction!) */ { unsigned rem = (next - mark) % bits; if (rem) { rem = bits - rem; if (rem >= inlen - next) break; next += rem; } } buf = 0; left = 0; /* mark this new location for computing the next flush */ mark = next; /* increment the number of bits per symbol */ bits++; mask <<= 1; mask++; } /* get a code of bits bits */ buf += (bits_t)(in[next++]) << left; left += 8; if (left < bits) { if (next == inlen) { ret = -3; /* partial code (not ok) */ break; } buf += (bits_t)(in[next++]) << left; left += 8; } code = buf & mask; buf >>= bits; left -= bits; /* process clear code (256) */ if (code == 256 && flags) { /* flush unused input bits and bytes to next 8*bits bit boundary */ { unsigned rem = (next - mark) % bits; if (rem) { rem = bits - rem; if (rem > inlen - next) break; next += rem; } } buf = 0; left = 0; /* mark this new location for computing the next flush */ mark = next; /* go back to nine bits per symbol */ bits = 9; /* initialize bits and mask */ mask = 0x1ff; end = 255; /* empty table */ continue; /* get next code */ } /* process LZW code */ { unsigned temp = code; /* save the current code */ /* special code to reuse last match */ if (code > end) { /* Be picky on the allowed code here, and make sure that the code we drop through (prev) will be a valid index so that random input does not cause an exception. */ if (code != end + 1 || prev > end) { ret = -2; /* invalid LZW code */ break; } match[stack++] = final; code = prev; } /* walk through linked list to generate output in reverse order */ while (code >= 256) { match[stack++] = suffix[code]; code = prefix[code]; } match[stack++] = code; final = code; /* link new table entry */ if (end < mask) { end++; prefix[end] = prev; suffix[end] = final; } /* set previous code for next iteration */ prev = temp; } /* make room for the stack in the output */ if (stack > size - have) { if (have + stack + 1 < have) { ret = 1; break; } do { DOUBLE(size); } while (stack > size - have); { unsigned char *mem = (unsigned char *)realloc(put, size); if (mem == NULL) { ret = 1; break; } put = mem; } } /* write output in forward order */ do { put[have++] = match[--stack]; } while (stack); /* stack is now empty (zero) for the next code */ } /* return the decompressed data, first reducing the allocated memory */ { unsigned char *mem = (unsigned char *)realloc(put, have); if (mem != NULL) put = mem; } *out = put; *outlen = have; return ret; } int unlzw(unsigned char **out, size_t *outlen, unsigned char *str, int str_len) { const char *errmsg[] = { "Prematurely terminated compress stream", /* -3 */ "Corrupted compress stream", /* -2 */ "Not a Unix compress (.Z) stream", /* -1 */ "Unexpected return code", /* < -3 or > 1 */ "Out of memory" /* 1 */ }; return unlzw(str, str_len, out, outlen); }

Ryan Clancy · Answer 18 · Sat Jun 08 2019 00:09:13 GMT+0800 (China Standard Time)

Closing this as #94 is adding directory tree and hashes for all collections.