osirrc / jig

Jig for the Open-Source IR Replicability Challenge (OSIRRC)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Collection format

amallia opened this issue · comments

According to the following the only information passed to the indexer is the collection name and the path where the collection can be found.
I am wondering if it makes sense to pass the format of the collection.

python run.py prepare \
    --repo rclancy/anserini-test --tag latest \
    --collections [name]=[path] [name]=[path] ...

Also, what are the collections used? I can see from https://github.com/osirrc2019/jig/blob/master/init.sh that are going to be core17, core18, robust04. Will they actually be named in this way?

What kind of formatting information were you thinking? The collections are distributed in a standard format, are you referring to if we pass you the archive file vs. the extracted archive?

No I was thinking that if we use standard formats like warc, tractext or trecweb we don't really need to rely on the collection name to inform the indexer which parser to use. This gives makes everything a bit more abstract and solid.
If the format is not provided we have to implement a mapping between a collection name and it is format. I.e. core18->trectext. Now, this looks very fragile since if the name gets changed from core18 to wapo everybody have to change their docker images.

Sounds good to me... I'll add it this week.

What is the best way to support multiple collections for this? Each may have a different format. We could do something along these lines:

We modify the --collections parameter for the prepare command from --collections [name]=[path] ... to --collections [name]=[path]=[format] ....

It's how we pass this to the image being run there's a couple different ways to do it:

  • we pass --collections [name]=[format] ... instead of --collections [name]
  • we pass an extra param --format with a mapping from name to format

The goal would be to have the least friction for a developer creating an image - thoughts? Alternatives?

I think the former is better as it is less redundant. Which is name=format...

Sorry for brevity.

Fixed in #41