osirrc / jig

Jig for the Open-Source IR Replicability Challenge (OSIRRC)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add snapshot image after initialization

johanneskiesel opened this issue · comments

Currently, prepare commits the intermediate image after indexing, but not after initialization:
https://github.com/osirrc2019/jig/blob/4e3765cd59b0869c354b2d7c6f9da826624e470e/run.py#L47

Doing also a commit after initialization can save time, network traffic, and disk space (due to the layered file system, the downloaded files are then only stored once and not for every image).

The tag could be something like "{}-initialized".format(args.tag)

I'm 👎 on this but open to discussion.

If we did this, we would have two images. For example:

  • anserini-test:latest-initialized after init is called
  • anserini-test:latest-indexed after index is called

where first image would be the base image for the second.

I think this would lead into some odd lifecycle management where we'd need to update the base image of the second to be the updated (after re-init) first image, if that's even possible. Another approach may be to start a container using the second image and re-run the init script, but this again can get complicated (init scripts should then be idempotent and need to clean-up existing files before downloading new ones).

I'm 👎 on this too for now as it would add a lot of hidden complexity.

Maybe then there is confusion here: Why would you want to re-init an image? I thought init is just about setup? So my confusion is: why would I want to run setup every time I index an collection, when I can just start with a snapshot of after setup was completed?

But in case you would need to re-init an image (I can imagine if you encountered an error or so), why can't you just create both latest-initialized and latest-indexed again? I see you would need an additional "--purge" parameter (or so) for allowing people to forcing an init even if there is already an initialized image.

I think the tradeoff is more complex lifecycle management... I think we're assuming that init/index will be done once and that's it.

I suppose with all the bells and whistles we can bind each subcommand to a hook and allow committing at each phase in a flexible manner? I'm inclined to punt on this for now though...

I see, and I want to say that it is not my intention to press this issue (which might have been lost from the original mail to this issue). I'm well aware that this can be added later on without a problem (it requires no change to the specification), so you can just wait to see whether index is done just once or more often.

No worries! Thanks for your contributions!