jupyter / nbdime

Tools for diffing and merging of Jupyter notebooks.

Home Page:http://nbdime.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

git-nbdiffdriver fails to parse branch1:file1..branch2:file2 syntax

dniku opened this issue · comments

I came across this issue while maintaining the https://github.com/yandexdataschool/Practical_RL repository, so I will provide examples based on it.

With the following setting in ~/.gitconfig

[diff "jupyternotebook"]
    command = git-nbdiffdriver diff

the following command fails:

$ git diff master:week07_seq2seq/practice_tf.ipynb..coursera:week6_outro/seq2seq/practice_tf.ipynb
usage: git-nbdiffdriver [-h] [--version] [--config] [--log-level {DEBUG,INFO,WARN,ERROR,CRITICAL}] {diff,webdiff,config} ...
git-nbdiffdriver: error: unrecognized arguments: index da535d8..a2c89e8 100644

fatal: external diff died, stopping at week07_seq2seq/practice_tf.ipynb

Apparently, either git-nbdiffdriver does not parse its arguments properly, or git does not pass them correctly.

nbdiff-web also has a similar issue:

$ nbdiff-web master:week07_seq2seq/practice_tf.ipynb..coursera:week6_outro/seq2seq/practice_tf.ipynb
Traceback (most recent call last):
  File "/home/ser/.local/bin/nbdiff-web", line 10, in <module>
    sys.exit(main())
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/webapp/nbdiffweb.py", line 96, in main
    return main_diff(opts)
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/webapp/nbdiffweb.py", line 79, in main_diff
    base, remote, path = resolve_diff_args(opts)
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/args.py", line 406, in resolve_diff_args
    if not is_gitref(base):
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/gitfiles.py", line 77, in is_gitref
    is_valid_gitref(candidate)
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/gitfiles.py", line 86, in is_valid_gitref
    repo.commit(ref)
  File "/usr/lib/python3.7/site-packages/git/repo/base.py", line 466, in commit
    return self.rev_parse(text_type(rev) + "^0")
  File "/usr/lib/python3.7/site-packages/git/repo/fun.py", line 322, in rev_parse
    obj = obj[rev[start:]]
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 298, in __getitem__
    return self.join(item)
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 225, in join
    item = tree[token]
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 298, in __getitem__
    return self.join(item)
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 244, in join
    raise KeyError(msg % file)
KeyError: "Blob or Tree named 'practice_tf.ipynb..coursera:week6_outro' not found"

Needless to say, without the ~/.gitconfig setting git works as expected.

The git docs says that the diff driver is supposed to be called with the following arguments:

path old-file old-hex old-mode new-file new-hex new-mode

I wonder if this is a newer feature, or an older one. Which version of git do you use?

$ git --version
git version 2.22.0

This syntax is documented in git-diff docs:

For a more complete list of ways to spell <commit>, see "SPECIFYING REVISIONS" section in gitrevisions[7].

Please note that the .. syntax indicates range. git-diff docs say this should be synonymous to passing the two endpoints ("This is synonymous to the previous form."). With git-nbdiffdriver, both result in the same error:

$ git diff master:week07_seq2seq/practice_tf.ipynb..coursera:week6_outro/seq2seq/practice_tf.ipynb
usage: git-nbdiffdriver [-h] [--version] [--config] [--log-level {DEBUG,INFO,WARN,ERROR,CRITICAL}] {diff,webdiff,config} ...
git-nbdiffdriver: error: unrecognized arguments: index 6681040..b9a8a11 100644

fatal: external diff died, stopping at week07_seq2seq/practice_tf.ipynb

$ git diff master:week07_seq2seq/practice_tf.ipynb coursera:week6_outro/seq2seq/practice_tf.ipynb
usage: git-nbdiffdriver [-h] [--version] [--config] [--log-level {DEBUG,INFO,WARN,ERROR,CRITICAL}] {diff,webdiff,config} ...
git-nbdiffdriver: error: unrecognized arguments: index 6681040..b9a8a11 100644

fatal: external diff died, stopping at week07_seq2seq/practice_tf.ipynb

However, with nbdiff-web the two produce different exceptions:

$ nbdiff-web master:week07_seq2seq/practice_tf.ipynb..coursera:week6_outro/seq2seq/practice_tf.ipynb
Traceback (most recent call last):
  File "/home/ser/.local/bin/nbdiff-web", line 10, in <module>
    sys.exit(main())
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/webapp/nbdiffweb.py", line 96, in main
    return main_diff(opts)
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/webapp/nbdiffweb.py", line 79, in main_diff
    base, remote, path = resolve_diff_args(opts)
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/args.py", line 406, in resolve_diff_args
    if not is_gitref(base):
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/gitfiles.py", line 77, in is_gitref
    is_valid_gitref(candidate)
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/gitfiles.py", line 86, in is_valid_gitref
    repo.commit(ref)
  File "/usr/lib/python3.7/site-packages/git/repo/base.py", line 466, in commit
    return self.rev_parse(text_type(rev) + "^0")
  File "/usr/lib/python3.7/site-packages/git/repo/fun.py", line 322, in rev_parse
    obj = obj[rev[start:]]
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 298, in __getitem__
    return self.join(item)
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 225, in join
    item = tree[token]
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 298, in __getitem__
    return self.join(item)
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 244, in join
    raise KeyError(msg % file)
KeyError: "Blob or Tree named 'practice_tf.ipynb..coursera:week6_outro' not found"

$ nbdiff-web master:week07_seq2seq/practice_tf.ipynb coursera:week6_outro/seq2seq/practice_tf.ipynb
Traceback (most recent call last):
  File "/home/ser/.local/bin/nbdiff-web", line 10, in <module>
    sys.exit(main())
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/webapp/nbdiffweb.py", line 96, in main
    return main_diff(opts)
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/webapp/nbdiffweb.py", line 79, in main_diff
    base, remote, path = resolve_diff_args(opts)
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/args.py", line 415, in resolve_diff_args
    if is_gitref(base) and not is_gitref(remote):
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/gitfiles.py", line 77, in is_gitref
    is_valid_gitref(candidate)
  File "/home/ser/.local/lib/python3.7/site-packages/nbdime/gitfiles.py", line 86, in is_valid_gitref
    repo.commit(ref)
  File "/usr/lib/python3.7/site-packages/git/repo/base.py", line 466, in commit
    return self.rev_parse(text_type(rev) + "^0")
  File "/usr/lib/python3.7/site-packages/git/repo/fun.py", line 322, in rev_parse
    obj = obj[rev[start:]]
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 298, in __getitem__
    return self.join(item)
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 225, in join
    item = tree[token]
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 298, in __getitem__
    return self.join(item)
  File "/usr/lib/python3.7/site-packages/git/objects/tree.py", line 244, in join
    raise KeyError(msg % file)
KeyError: "Blob or Tree named 'practice_tf.ipynb^0' not found"

The first of these errors might suggest that tokenization is broken.

This syntax is documented in git-diff docs:

For a more complete list of ways to spell , see "SPECIFYING REVISIONS" section in gitrevisions[7].

Thanks for the links. However, that is the docs for the CLI of git itself. In their section on external diff drivers (this page under the heading "Defining an external diff driver") it says:

To define an external diff driver jcdiff, add a section to your $GIT_DIR/config file (or $HOME/.gitconfig file) like this:

[diff "jcdiff"]
    command = j-c-diff

When Git needs to show you a diff for the path with diff attribute set to jcdiff, it calls the command you specified with the above configuration, i.e. j-c-diff, with 7 parameters, just like GIT_EXTERNAL_DIFF program is called. See git[1] for details.

As such, git breaks their own docs by passing .. notation to external diff tools. Ideally this should be reported to git. If they have updated their API, I would strongly prefer them to publish a new spec for that before attempting to implement support.

However, with nbdiff-web the two produce different exceptions:

Neither the .. nor <ref>:<path> syntax is currently supported by the nbdiff[-web] CLI. We use GitPython under the hood, so if that package has support for any of these, then PRs that adds support for this in nbdime will be very welcome :)

Before reporting the problem to git@vger.kernel.org, I would like to ensure that this is indeed a bug. I have created a small repo to reproduce the problem: https://github.com/dniku/git-external-diff-argv Could you verify that what this repo describes is actually a bug?

Whether it is a bug or simply outdated documentation might not be very clear. I would of course also recommend that you search the mailing list archives first for similar reports (it might even be fixed).

Other than that the report seems sane from my view, but I would maybe link directly to the relevant header: https://www.git-scm.com/docs/git/2.22.0#Documentation/git.txt-codeGITEXTERNALDIFFcode

Submitted here.

Would you like to to file a separate issue for lack of support for <ref>:<path> in nbdiff[-web] to keep track of that problem?

Would you like to to file a separate issue for lack of support for : in nbdiff[-web] to keep track of that problem?

We can do that issue here, since the other issue has been pushed upstream.

My (somewhat arbitrary) guess is that perhaps it was intended to use .rev_parse() instead of .commit() in https://github.com/jupyter/nbdime/blob/f96ac65/nbdime/gitfiles.py#L86.

In [1]: from git import Repo                                                                                                                                                                                                                                                                                                                                                                                                         

In [2]: repo = Repo('.')                                                                                                                                                                                                                                                                                                                                                                                                             

In [3]: repo.rev_parse('origin/branch1:file1.txt')                                                                                                                                                                                                                                                                                                                                                                                   
Out[3]: <git.Blob "802b1c4ed7b06162b2ce09b7db72a576695b96e5">

In [4]: repo.commit('origin/branch1:file1.txt')                                                                                                                                                                                                                                                                                                                                                                                      
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-eb9fabe894f3> in <module>
----> 1 repo.commit('origin/branch1:file1.txt')

/usr/lib/python3.7/site-packages/git/repo/base.py in commit(self, rev)
    464             return self.head.commit
    465         else:
--> 466             return self.rev_parse(text_type(rev) + "^0")
    467 
    468     def iter_trees(self, *args, **kwargs):

/usr/lib/python3.7/site-packages/git/repo/fun.py in rev_parse(repo, rev)
    320                     obj = obj.tree
    321                 # END get tree type
--> 322                 obj = obj[rev[start:]]
    323                 parsed_to = lr
    324             else:

/usr/lib/python3.7/site-packages/git/objects/tree.py in __getitem__(self, item)
    296         if isinstance(item, string_types):
    297             # compatibility
--> 298             return self.join(item)
    299         # END index is basestring
    300 

/usr/lib/python3.7/site-packages/git/objects/tree.py in join(self, file)
    242                                                                join_path(self.path, info[2]))
    243             # END for each obj
--> 244             raise KeyError(msg % file)
    245         # END handle long paths
    246 

KeyError: "Blob or Tree named 'file1.txt^0' not found"

rev_parse() can successfully resolve the <ref>:<path> notation to a Blob. Can that be used by nbdime?

rev_parse() can successfully resolve the <ref>:<path> notation to a Blob. Can that be used by nbdime?

Sounds like that is what we want. The logic for where we currently use commit would need to be updated to support all the return types from rev_parse though: https://gitpython.readthedocs.io/en/stable/reference.html#git.repo.base.Repo.rev_parse

For the record, while upstream is deciding what to do with this, I am using the following workaround:

#!/usr/bin/env python

import sys
import subprocess

if len(sys.argv) == 10:
    old_filename, old_file, old_hex, old_mode, new_file, new_hex, new_mode, new_filename, wtf = sys.argv[1:]
    subprocess.check_call([
        'nbdiff',
        old_file,
        new_file,
    ])
else:
    assert False

and

env GIT_EXTERNAL_DIFF=./wrap_nbdiff.py git diff ...

It has a weird issue: before showing the actual diff, the following error is printed multiple times:

fatal: cannot run ./wrap_nbdiff.py: No such file or directory
fatal: external diff died, stopping at before

but it works, at least.

(nbdiff can be also replaced with nbdiff-web, which does not trigger this weird error and also looks better, but that's not exactly an equivalent for git diff)

Jeff King writes:

The last two are:

  • when the destination path differs from the source path, we append
    it. This is generally a sign of a rename/copy, though it triggers
    in the blob case because Git has been manually given a pair of files
    with non-matching names.

  • the final one is metainfo that Git typically prints between the
    "diff" header and the diff itself. This is added only when we add
    the filename, and would generally carry information about the rename
    (but of course since there isn't one, it has only the index line).

These were both added by 427dcb4bca ([PATCH] Diff overhaul, adding half
of copy detection., 2005-05-21). I think the intent was that existing
diff commands would/could ignore the extra arguments.

Certainly the world would be a better place if those were described in
the external-diff documentation (specifically in relation to renames,
which is their intended use).

I guess this can be used as reference for git-nbdiffdriver CLI args.

Seems reasonable, although I would also hope to get an explanation for the arguments passed when using .., and how that overlaps with everything else.

Never mind the last comment :). Do you want to have a go at fixing this?

Huh. Strangely, it seems that not only the rename_to arg is present in the latest version, but it has been there since the inception (and you were the author of that first commit).

I confirm that the addition of a single line

diff_parser.add_argument('rename_metadata', type=Path, nargs='?', default=None, action=SkipAction)

in the add_git_diff_driver_args() function resolves the

usage: git-nbdiffdriver [-h] [--version] [--config] [--log-level {DEBUG,INFO,WARN,ERROR,CRITICAL}] {diff,webdiff,config} ...
git-nbdiffdriver: error: unrecognized arguments: index 6681040..b9a8a11 100644

error for me.

#498 should fix one of the issues discussed here. I'm not yet sure that I want to take a stab at integrating support for diffing blobs into nbdiff.

I'm not yet sure that I want to take a stab at integrating support for diffing blobs

That's ok, but be aware that my TODO pile is quite high at the moment 😞

In the meantime, it seems that the following alias in ~/.gitconfig is a workaround for the nbdiff-web issue:

[alias]
    jdiff = "!f() { GIT_EXTERNAL_DIFF='bash -c \"nbdiff-web $2 $5\"' git diff \"$@\"; }; f"