datalad / datalad

Keep code, data, containers under control with git and git-annex

Home Page:http://datalad.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

run-records created on windows are not reexecutable on another platform

mih opened this issue · comments

(and vice versa, likely)

The issue is an analog of datalad/datalad-container#224

❱ datalad rerun
[INFO   ] run commit c81d336; (something) 
[INFO   ] Making sure inputs are available (this may take some time) 
run(error): /tmp/dv (dataset) [Input did not match existing file: input\sub-01\meg\sub-01_task-somato_meg.fif]
unlock(ok): f (file)

Input/output specifications are required in platform path conventions, but the platform is not recorded. This makes it generally impossible to determine the storage conventions. It may be possible to guess them using an approach similar to the one implemented in datalad/datalad-container#243. However, given the more general problem here, the fraction of use cases covered would be smaller.

I would suggest that all path specifications should be stored in POSIX notation (this would require them to be relative too, which may already be the case).

A guesstimation algorithm could check if any immediate subdirectory of a backslash-path matches an existing directory in the dataset, and then convert. However, this would need to be made in a way that is also (or only) accessible to rerun.

Pretty big mess.

I think a first line of action should be to stop the further creation of such records. The simplest way to achieve this is a patch of the _create_record() helper in run.

Right now this function just calls json.dumps(), which let's platform native path end up in the record.

It should perform an explicit conversing to POSIX paths for inputs and output, prior to JSON serialization.

If this change is to be made by a runtime patch, the helper can be wrapped in a function that performs this alteration, and then calls the original with the result.

Ping sent to chatroom. Will resolve via force-to-POSIX. This is required for a robust implementation of datalad/datalad-next#143

I think it is a great idea to store in POSIX what we can. Unfortunately command itself can contain windowsy relative paths, e.g. code\myscript even if people do specify input/outputs as arguments. But I do not think we could/should do anything about that (although could mutate first argument of the command which does not have = in it... but that is too much of a heuristic)

I thought about that scenario too. I believe that the "correct" solution is to document how a user need to approach this in order to yield portable records. And that is, from my POV:

  • any script/executable provided by the dataset (or any subdataset) must be considered an input, hence should be declared as such
  • the command specification should never duplicate input/output declarations, and always reference them
  • given that any input/output declarations are paths or globs they should be normalizable for documentation in the prov record

If above rules are followed and normalization is implemented, we should have portable records, also for the case you pointed out.

A user can decide to ignore such guidelines, and live with the consequences, of course.

  • any script/executable provided by the dataset (or any subdataset) must be considered an input, hence should be declared as such

that's cute... but pragmatically might be painful to use/implement. We might want eventually to add a way to give inputs/outputs values, like --input-cmd PATH and then allow for use of {inputs[cmd]} or alike to at least partially reduce the pains. But note that such pains would be only Windows specific since others would be natively using POSIX paths.

A complimentary feature could be allowing to annotate paths in the invocation, e.g. via smth like {path:code\myscript} and our code handling those while handling those {values} which we do already anyways.

Do I get it right here: there are different problems ?

  • input/output options: path specification not saved, not portable (1)
  • command itself can have path (2) (see comment above command itself can contain windowsy relative paths)
  • code itself may call other code using a specific path specification (3)

It seems 1 is a datalad issue,
2 may be solved by having a new type of option (input/output/code) once 1 is solved
3 is not a datalad issue, users should think about portability of their code.

Note for 3:
Capitalisation is also an issue: linux requires right capitalisation, macos do not: so a code working perfectly on macos can fail on linux.
I think this goes beyond datalad and should not be solved by it.

Note for 2:
that would indeed be nearly always the case right? what to do instead now ?
(see figure 4 here: https://datalad-handbook--950.org.readthedocs.build/en/950/usecases/RStudio_user.html, that would not rerun on windows I presume)

Note: I am a R user, script are called via source() that does the conversion in the background (the path, not the capitalisation :) ).