datalad / datalad

Keep code, data, containers under control with git and git-annex

Home Page:http://datalad.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`datalad update --follow parentds` does not follow parent dataset

mlell opened this issue · comments

What is the problem?

Hi,

I have a hierarchy of two datasets, where

  • B is a subdataset of A
  • B is behind its newest version, but
  • A has saved this older version of B

I have made a small modification (doc update) to B "en-passant", so I pulled the newest version of B, made the change, pushed it, now I want to revert it back to the version that is required by the superdataset

In the folder of B, I called: datalad update --how=reset --follow=parentds

  • Expected behaviour: An older version of B is checked out and the superdataset is clean.
  • Actual behaviour: The newest version of B is still checked out and the superdataset is dirty (Big "M" at the subdataset folder in git status -s). I have to manually find out the saved submodule commit in the superdataset and git reset --hard in the subproject to reset back to the commit expected by the superproject

What steps will reproduce the problem?

  • Create repo A, commit
  • Create repo B, commit
  • Add B as subdataset to A, save A
  • In B upstream, make a change
  • In B inside A, pull the change from upstream, make another change, commit, push to B upstream
  • In B inside A, do datalad update --how=reset --follow=parentds

DataLad information

datalad 0.18.3
git-annex version: 10.20230329-g30d7f9ad7
Linux Rocky 9

Additional context

No response

Have you had any success using DataLad before?

Datalad is useful in many cases for me, but by newcomer/biologist colleagues often get stuck where they would need advanced git/git annex knowledge to understand error messages / get it working again.

Thank you for the very clear description @mlell - I'm sorry that it sat so long unanswered.

I can't speak to the code or reasons behind this behavior, but the description of --follow in the datalad update docs says:

Note that the current dataset is always updated according to 'sibling'. This option has no effect unless a merge is requested and --recursive is specified

The datalad update in your case should be called from A:

datalad update -d .  --recursive --how reset --follow parentds B

Note that this translates to: "update dataset at current location (in this case, A) recursively, limiting operations to B".

If you want to run the command within B, then (depending on your relative paths) you can provide things like -d /path/to/A, -d .. or -d \^, where ^ means "top-most parent dataset", e.g.:

datalad update -d ^ --recursive --how reset --follow parentds

or, even better, again restricting to B (path argument relative to parent dataset root):

datalad update -d ^ --recursive --how reset --follow parentds B

adjusting to your dataset structure if needed.

p.s. I wrote a short reproducer for the A/B datasets you describe:

show code
mkdir /tmp/follow
cd /tmp/follow

# Create repo A, commit
datalad create A
cd A
echo "1 2 3" > file.dat
datalad save -m "First commit"
cd ..

#Create repo B, commit
datalad create B
cd B
echo "4 5 6" > file.dat
datalad save -m "First commit"
cd ..

# Extra step: allow pushing to checked out branch (will serve as upstream)
# (datalad create-sibling does that too)
cd B
git config --local receive.denyCurrentBranch updateInstead
cd ..

# Add B as subdataset to A, save A
cd A
datalad clone -d . /tmp/follow/B
cd ..

# In B upstream, make a change
cd B
datalad run -m "Second commit" -i file.dat -o file.dat "echo 7 8 9 >> file.dat"
cd ..

# In B inside A, pull the change from upstream, make another change, commit, push to B upstream
cd A/B
datalad update -d . -s origin --how ff-only  # note: not recursive, so B gets updated, while A becomes "dirty"
datalad run -m "Third commit (en passant)" -i file.dat -o file.dat "echo 7 8 9 > file.dat"
datalad push -d . --to origin

Dear @mlsw, thank you for taking a look into this. I did not know about -d ^ which seems useful. However this also means that for your last suggestion, I need to adapt the path at the end depending on what the topmost dataset is. So I might need to give
datalad update -d ^ -r ... B if, say, A is the topmost dataset and B is directly in the folder A/B, but if A itself is checked out as subdataset of X, I might need datalad update -d ^ -r ... A/B? That would be quite a cumbersome way to give the instruction, since no information other than the direct superdataset is required for this operation

It seems that the current design of datalad avoids checking the superdataset by itself unless it is requested by -d. Probably there are reasons behind it, but from my point of view I have always found -d very confusing because there is always that duplication with the positional path argument at the end and it leads to subtly different behaviours that are confusing for newcomers, for example

  • datalad clone -d . URL FOLDER adds URL as datalad-url to the .gitmodules file while datalad clone URL FOLDER && datalad save does not, despite URL is still available as remote url of the subdataset
  • datalad status -d DATASET PATH ... I am not able to grasp the subleties of what information is given me by which combination of DATASET and PATH
  • datalad drop -d DATASET PATH: Interactions between DATASET and PATH and then more subtleties depending if PATH is a dataset of its own and is terminated with a slash or not.

I know it is quite an extensive topic and the reasons of having -d are probably probably more deep than what I see here from the user perspective. I just want to point it out somewhere until I am able to dig deeper into that behaviour that it would be much easier for me to explain datalad usage to our newcomers if -d would not exist. Also it is cumbersome to specify the exact dataset locations. I need to give the exact location of the dataset despite very few folders I can give make much sense. It would be far easier if there was just a --save switch that would call datalad save . after the action (because that is what I use -d most often for) or/and an equivalent to the -R switch (maybe -S for "recurse (s)uperdatasets"?) that lets me specify how many levels of superdatasets to modify.

It looks like there is no simple solution for this specific problem (EDIT actually you provided a solution that does work ) and the remarks about -d probably are not thought out enough for an own issue report. So feel free to close this issue if adequate.