gitpython-developers / GitPython

GitPython is a python library used to interact with Git repositories.

Home Page:http://gitpython.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Diffable.diff is misleadingly annotated regarding special `other` values

EliahKagan opened this issue · comments

The other parameter of Diffable.diff is annotated this way:

other: Union[Type["Index"], "Tree", "Commit", None, str, object] = Index,

Diffable.diff accepts a few special values:

  • None for a working tree diff.
  • Diffable.Index, a type object that is meant to be used opaquely rather than instantiated, to compare to the index.
  • NULL_TREE, a direct instance of object (not to be confused with Object), for a diff against an empty tree.

The type of NULL_TREE should be expressed better than object

The main problem here, which exists in the documentary effect of the annotations even if a type checker is never used, is the presence of object in the union. Since object is the root of the type hierarchy in Python, a union of anything with object is equivalent to just object. But the intent of the annotation is not to express that it is correct to pass arbitrary objects as other. Instead, the idea is that it is okay to pass NULL_TREE.

This is also the cause of the incompatible override type error on the IndexFile.diff override in git.index.base, which does not include the object alternative:

other: Union[Type["git_diff.Diffable.Index"], "Tree", "Commit", str, None] = git_diff.Diffable.Index,

There are several places in GitPython's source code where an overridden method has an incompatible signature, violating substitutability. The reason this place is of interest is that it is hard to tell efficiently by examining the code or type errors what conceptually this means. It turns out it means the override does not support NULL_TREE:

>>> import git
>>> repo = git.Repo()
>>> repo.index.diff(git.NULL_TREE)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\ek\source\repos\GitPython\git\index\base.py", line 1521, in diff
    raise ValueError("other must be None, Diffable.Index, a Tree or Commit, was %r" % other)
ValueError: other must be None, Diffable.Index, a Tree or Commit, was <object object at 0x00000265862B8A30>

But object does not express (to humans or tools) that what is really meant is the literal value NULL_TREE.

We cannot currently write Literal[NULL_TREE], but if NULL_TREE is made into an enumeration constant, then this can be done, though cumbersomely it seems it can only be written--in annotations--as ET.NULL_TREE where ET is the declared enumeration type. Or, if ET has NULL_TREE as its only value, then writing ET should be sufficient, though this may be less intuitive due to not involving Literal in any way while expressing a literal.

To be clear, this will not cause the mypy error about an incompatible override to go away. But it will make the error make sense, as well as making the suppression comment on it make sense in terms of what it is expressing. And it will allow type checkers to catch errors about incorrect values of other.

Diffable.Index should ideally also be improved

Diffable.Index is used as an opaque constant, as is NULL_TREE. But Diffable.Index is a class. This is confusing because it should not be instantiated. Its static type can be expressed pretty specifically, as Type[Diffable.Index] (as is done; I'm omitting the quotes here for simplicity). However, this does not allow if-else logic to be type-checked exhaustively, because there is no guarantee that the Diffable.Index class is the only value of the static type Type[Diffable.Index]:

class Derived(Diffable.Index):
    pass

Hopefully no one is doing that--just as hopefully no one is instantiating it--but mypy and other tools, including editors, cannot tell that Diffable.Index is the only value of Type[Diffable.Index]. Humans who notice that it is intended to be used as an opaque constant will recognize this, however. Also, it should be possible to get type checkers to recognize it as the only value, and type-check if-else logic exhaustively, by decorating the definition of Index with @final. This works for some type checkers, at least pyright and pylance. But it does not work for mypy.

The bigger issue is that it is easy to confuse and, for example, instantiate it.

Diffable.Index can also be made an enumeration constant, and all reasonable usage will continue to work. At runtime. However, if someone wants to statically annotate their own method to accept or return it, then they would need to change the annotations, since expressing it with Type[Diffable.Index] would no longer be able to work. (Two separate special objects cannot both be used, because existing subclasses of Diffable outside of GitPython would only be covering one. But if it is redefined, existing code can continue to cover it at runtime, since it should only ever have been, or be, used as an opaque constant.)

It is also possible for unreasonable usage to break at runtime, such as attempting to check if x is Diffable.Index by writing issubclass(x, Diffable.Index). I am less worried about this.

It seems to me that this change is worth making for Diffable.Index as well, but this is a judgment call, and the new interface should perhaps be given special attention in review. When I searched for existing code outside GitPython, its forks, vendored copies of it, etc., that used these annotations in a way where they would cause a new static type error, I didn't find anything, but I only used GitHub code search, and I may not have found everything even on GitHub. My argument for making this change is to achieve greater correctness without breaking anything at runtime, and not that I am at all confident nobody will have to change their code to keep it passing static checks.

Connection to #1859

I've included fixes for both of these things in #1859. See 65863a2 especially, whose docstring has some more information about the specific design choices there.

If you decide you don't want the change to Diffable.Index there, that can be undone without sacrificing the rest. Although I am in favor of this aspect of the change, it should not be accepted on the basis of the sunk-cost fallacy. Furthermore, the sunk cost would be quite low, because it was already helpful in enabling me to figure out what needed to be done in other related parts of the code.

The purpose of this issue is twofold:

  • To allow the bug to be tracked if #1859 is changed in such a way as not to fully solve it.
  • To described it separately from that PR so it's easier to understand. I considered describing it in comments there, but thought opening an issue would be better.

Thanks a lot for documenting these issues with the diff API. Creating an API like this was clearly a mistake as it's not helping writing more readable code at all due to the overloadedness of the input arguments, all that while 'leveraging' the class hierarchy.

Thus I am not surprised that this is notoriously difficult to annotate, but am happy that this seems to be possible.

My argument for making this change is to achieve greater correctness without breaking anything at runtime, and not that I am at all confident nobody will have to change their code to keep it passing static checks.

I'd argue that a failing mypi check won't constitute as breaking change, it's a grey-zone I am OK to walk into if it one day allows type-checkers to make using GitPython easy (due to them actually being accurate). This is one of the few areas where GitPython can still change as well. Knowing that such a change might break somebodies CI, the only constraint I'd apply to myself is that such 'breaking' changes shouldn't be done lightheartedly.

Knowing that such a change might break somebodies CI, the only constraint I'd apply to myself is that such 'breaking' changes shouldn't be done lightheartedly.

Speaking of this, in #1859 I should probably have raised the question of what to call the type of the enumeration in git.diff in which NULL_TREE and INDEX (the latter being what Diffable.Index now aliases) are defined. I called it DiffConstants. I mentioned one aspect of this in 65863a2:

the name DiffConstants (rather than, e.g., Constants) should avoid confusion even if it ends up in another scope unexpectedly.

There is also the question of singular and plural. This enumeration is named as a namespace of the constants it defines, rather than in the style one would ordinarily name a type: it is awkward to say "NULL_TREE is a DiffConstants". Since enumerations always define constants, typically they shouldn't have Constants or Constant in their type names, but here it seemed to help distinguish it from everything else in git.diff. In some languages there are strong conventions for naming enumerations that help resolve this sort of thing; for example, in C#, enumerations should have plural names when they provide flags (and have the [Flags] attribute) and should have singular names otherwise. I don't know of such conventions for Python.

Possibilities like DiffSpecial did not seem better. Anything without Diff in its name would be confusing when accessed directly in git. The top-level git module currently uses wildcard imports for almost everything in GitPython, but that can and should change even separately from this, so one approach could be to change that, at least for its imports from git.diff, so as not to list that enumeration's type: NULL_TREE and INDEX should still be imported in git.__init__ and listed in git.__all__, because NULL_TREE was there before, but DiffConstants could defensibly not be imported in git.__init__.

(In contrast, code using GitPython would have trouble annotating its own functions--if it needs to use this type--if the type were nonpublic, which its exclusion from the subpackage git.diff.__all__ would signify unless otherwise documented, and would appear to signify even if otherwise documented.)

If this is to be changed, it would be best changed before the next release, after which the old name may have to be kept as an alias of the new name in order to avoid breakage. (Removing the old name after a release would probably be an actual breaking change because it would not just break type checking but would also cause runtime errors to occur in imports and elsewhere, in reasonable code written after the first name was added and before it was removed.)

To be clear, I'm not saying this needs to change, only that there may be a benefit to considering the name and deciding if it does.

Thanks for highlighting these possible concerns.

Since naming is hard, and those names didn't strike me as problematic, I'd think it's good to go with them as the new state of affairs is certainly better than the previous one.