gitpython-developers / GitPython

GitPython is a python library used to interact with Git repositories.

Home Page:http://gitpython.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Git.version_info type annotation is overly specific in length

EliahKagan opened this issue · comments

The version_info property is declared to be a length-4 tuple:

GitPython/git/cmd.py

Lines 833 to 841 in afa5754

@property
def version_info(self) -> Tuple[int, int, int, int]:
"""
:return: tuple(int, int, int, int) tuple with integers representing the major, minor
and additional version numbers as parsed from git version.
This value is generated on demand and is cached.
"""
return self._version_info

But it often has fewer values, which is intentional:

GitPython/git/cmd.py

Lines 814 to 826 in afa5754

def _set_cache_(self, attr: str) -> None:
if attr == "_version_info":
# We only use the first 4 numbers, as everything else could be strings in fact (on Windows).
process_version = self._call_process("version") # Should be as default *args and **kwargs used.
version_numbers = process_version.split(" ")[2]
self._version_info = cast(
Tuple[int, int, int, int],
tuple(int(n) for n in version_numbers.split(".")[:4] if n.isdigit()),
)
else:
super()._set_cache_(attr)
# END handle version info

So the type annotation should be changed, but I am unsure what it should be changed to, whether any additional documentation should be added, and whether having fewer than some number of numeric fields should be treated as an error and cause an exception to be raised.

If the type annotation should reflect that only a few specific lengths are reasonable, then it could be expressed as a Union of specific-length Tuple types. Otherwise it could be expressed as a variable-length tuple by writing the type as Tuple[int, ...].

I recommend this be solved in such a way that the cast can be removed. Currently, the cast is working around ambiguity in how the return type is intended to be understood.

This is separate from #1829, though I noticed it while looking into that and they could possibly be fixed (or improved or otherwise closed) together.

Indeed, having more than major.minor.patch as part of the version is unlikely and I don't know where that would have been encountered. I see git version 2.39.3 (Apple Git-145) as version on my system, but that wouldn't parse here either as four digits.

Thus this can probably be reduced to 3 digits to be more realistic.

Also a question is if that would be a breaking change.

Some downstream versions have a fourth numeric part with no intervening non-numeric parts. This fourth part identifies the downstream version number, to reflect that the downstream version may have its own patches and also that there may be more than one downstream version per upstream (major, minor, patch) version, such as to backport security fixes from later versions. In particular, the system git on the CentOS 7 system I used to produce #1829 gives:

❯ /usr/bin/git --version
git version 1.8.3.1

That produces a version_info of (1, 8, 3, 1). I think it may cause problems to drop the fourth digit after having included it, since it makes sense to me that some applications may check for and use that, so I recommend not truncating it to three.

If it is always three or four parts, then the type could be given as Union[Tuple[int, int. int], Tuple[int, int, int, int]]. Then there is still the question of whether having fewer than three parts should raise an exception, or what it should do.

An argument for using Tuple[int, ...] instead is that it shouldn't be necessary to rely on the length or pick out the individual numbers in most cases, since tuples compare lexicographically. However, because there is probably some such code, it may be better to change it to a union of fixed-length tuple types, so that type-checking can remain as precise as possible.

For the annotation change itself, that may be breaking in the weak sense that type checkers would start flagging new type errors. But these would be type errors that were already present. So I think that would be justified as a bugfix.

Another option is to go in the opposite direction from Tuple[int, ...] and take this opportunity to make it a named tuple (or union of named tuple types), so that in addition to iterating though it or accessing the numbers positionally, they could be accessed by attributes major, minor, and patch. The fourth could probably be called build.

(There is also the question of whether the build number 145 in the example you showed should be parsed out, but this could probably be decided later and the feature added if desired. It may also break some code that has relied on the current parsing.)

Thanks for the clarification.

I see what I misunderstood - this issue wasn't about truncating the amount of digits, but to adjust the type annotation to better represent the reality of sometimes only having three digits anyway.

Another option is to go in the opposite direction from Tuple[int, ...] and take this opportunity to make it a named tuple (or union of named tuple types), so that in addition to iterating though it or accessing the numbers positionally, they could be accessed by attributes major, minor, and patch. The fourth could probably be called build.

It would certainly be nice to have named fields for each positional field as well - type-checkers probably will do a better job at explaining the data contained within to the user.

(There is also the question of whether the build number 145 in the example you showed should be parsed out, but this could probably be decided later and the feature added if desired. It may also break some code that has relied on the current parsing.)

I'd prefer to treat this as out-of-scope for this issue, without the immediate need to take on such a change, despite it's potential usefulness.

It would certainly be nice to have named fields for each positional field as well

I am thinking maybe that this test should be narrowed to assert whatever we assume is always true of a real git command's git version output:

GitPython/test/test_git.py

Lines 173 to 174 in afa5754

def test_it_executes_git_and_returns_result(self):
self.assertRegex(self.git.execute(["git", "version"]), r"^git version [\d\.]{2}.*$")

Then the appropriate named tuple type or types can be made--or, alternatively, it can be discovered that this cannot reasonably be done--and the best type annotation can be figured out.

Right now it uses the regular expression:

^git version [\d\.]{2}.*$

It makes sense that this is so general, since the goal seems to be just to verify that git version was really run. But the relationship between what we assume the dynamic version method gives and what information we assume is present in the value of the version_info property seems conceptually strong.

Yet, I do worry about breaking things if people are using weird git builds or their own wrapper scripts. At some point it is unreasonable to have it as a goal to accommodate that, even if it worked before, but I am unsure what that point is. Refreshing doesn't check the version at all, but only that the git version command ran and reported success:

GitPython/git/cmd.py

Lines 379 to 384 in afa5754

has_git = False
try:
cls().version()
has_git = True
except (GitCommandNotFound, PermissionError):
pass

I'm not sure if it's really okay to change that to check something more specific. If not, then I'm not sure how specific the return type of version_info can reasonably be, and more so I'm not sure if it is okay to add validation to the version_info property.

One possibility is to make a variable-length named tuple type that raises the usual IndexError if an attribute like patch doesn't exist because the tuple is not long enough. (Raising IndexError here would maintain backward compatibility at runtime, and also is what would automatically happen if property getters for attributes like patch are implemented straightforwardly in terms of subscripting.)

I had originally thought this bug should be fixed together with #1829, but I'm realizing that, even with these decisions to be made here, this is probably faster and easier to fix and test and could reasonably be done on its own.

I definitely don't have an answer, but would hope that the new way of doing things won't be more restrictive than the old.
With that constraint, anything is possible. But I see how my comment is too general to be helpful :/.

This actually narrows things down quite a bit, because there are only a few ways I know how to do it that avoid being more restrictive than before.

The existing restriction

One already-present restriction is that the output of git version--not the version string part, but the whole output--is expected to have at least three space-separated fields, of which the third is taken to be the version string:

GitPython/git/cmd.py

Lines 817 to 818 in afa5754

process_version = self._call_process("version") # Should be as default *args and **kwargs used.
version_numbers = process_version.split(" ")[2]

If there are fewer than three space-separated fields in the output of git version, then IndexError is raised by the subscripting operation in process_version.split(" ")[2].

Should this be a union?

We do not enforce that the output begins with git version or that any of the fields of the version string can be parsed as numbers. Of particular interest is the case where there are no numbers, because it means we can get an empty tuple back. If this were disallowed, then I think there would be a moderately strong case for using the return annotation:

Union[Tuple[int], Tuple[int, int], Tuple[int, int, int], Tuple[int, int, int, int]]

There is no syntactic reason to shy away from allowing tuples of zero ints, i.e. empty tuples, as well. Just as Tuple[int, int] is really Tuple[(int, int)], we can write Tuple[()] as the static type of empty tuples. So this is okay:

Union[Tuple[()], Tuple[int], Tuple[int, int], Tuple[int, int, int], Tuple[int, int, int, int]]

But the restriction that there are at most four fields seems less important for using the library than knowledge that there are at least some number of fields. As such, expressing this as a union of tuple of types seems like it is not justified if Tuple[()] is included as an alternative. Another way of getting at this is that:

  • Calling code that heeds static type annotations should not have runtime errors that arise due to the specific claims made in the annotations being false.
  • It would be unexpected, and probably undesirable, for a static type checker to treat Git().version_info[0] as a static type error, even though people shouldn't usually write that (and, as noted below, GitPython itself doesn't).

In addition, writing code that returns such a union is cumbersome to do in a way that is fully checkable (i.e., without a cast or other suppressions), because mypy currently does not infer compatibility with such a union from the presence of a :4 slice, nor even from an assertion that the len is <= 4. (It can be done by asserting a condition of five == comparisons joined by or.)

So I think this points toward expressing the type as being a tuple whose length is not statically attested but whose elements are all known to be of type int. The ellipsis syntax achieves this:

Tuple[int, ...]

Why not make it like sys.version_info?

Git.version_info is named analogously to sys.version_info, which is a named tuple that supports the kinds of order comparisons to numeric-typed tuples that are actually useful, but that is not guaranteed to be (and typically is not) fully numeric:

>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=12, micro=2, releaselevel='final', serial=0)
>>> isinstance(sys.version_info, tuple)
True
>>> list(sys.version_info)
[3, 12, 2, 'final', 0]
>>> sys.version_info >= (3, 12)  # The useful kind of check.
True

However, it might not be straightforward for Git.version_info to work like that, because:

  • While new users/developers who are familiar with sys.version_info may find it intuitive, it would be a breaking change given the long-standing guarantee that each element of version_info on a Git object has type int.
  • Either less is guaranteed about which fields of a git version string are numeric, or the guarantees would need to be researched, considered, and expressed (and perhaps also checked against existing uses of GitPython).

It can be a variable-length named tuple

It is still possible for Git.version_info to be more specific than Tuple[int, ...] if it is effectively a named tuple specialization of Tuple[int, ...] with attributes for common fields (major, minor, patch, and probably build, as discussed above) that raise IndexError when those fields are absent (so build would usually raise). This cannot be implemented with typing.NamedTuple or collections.namedtuple but the implementation would not be especially complicated. The class docstring would also facilitate documenting what is expected.

If done as a named tuple, it should have tests. In addition, it would carry the risk that users/developers would wrongly assume that all or most of the named attributes usually work, even though only the first three are typically present.

(If the first three are truly known always to be present, such that they will be present in all future versions of git too, including prereleases of new major versions, then the requirement to avoid being more restrictive should possibly be revisited.)

A simple, first solution

I think it's useful to have a simple solution that does away with the cast, at the same time as a fix for #1829, so it's easier to notice indications of a regression in mypy output when doing that. So I am inclined to make the return type Tuple[int, ...] initially, and then a more refined solution such as a variable-length named tuple type (or the introduction of restrictions?) could be done at some point in the future.

Does GitPython itself assume restrictions?

GitPython seems not to include code that fails with a hard error if version_info is too short, even if it were zero-length. It uses:

  • Slicing with only an upper bound, which in Python does not enforce that the result be of that length (it can be shorter).
  • Lexicographical comparison, which is well-defined and reasonable even between tuples of different lengths, so long as corresponding elements always support order comparisons (so with tuples each of whose elements is an int, it always works).

It looks like the slicing isn't necessary in the uses of version_info in GitPython. Its presence may reflect an assumption about the length, but I don't think that assumption causes problems. For example, in the strange case that no field of the version string were numeric, the empty tuple would compare as lower version than any non-empty tuple. I think the unnecessary slicing should probably be removed, though I think I'll put that off until sometime later than my first PR fixing version_info-related bugs.

That GitPython seems to work okay in its own use of version_info even if it is shorter than might be expected supports the aspiration you have stated (which I am taking as a requirement) to avoid becoming more restrictive.

Thanks so much for this incredible summary of the whole ordeal, I feel like I can picture the situation much more clearly now.

Please go ahead like you prefer and we take it from there. In case it helps, it's also possible to internally use a new version of this field, version_info2 for instance, and deprecate the old one for perfect backward compatible changes without being held back by decisions of the past.

Please go ahead like you prefer and we take it from there.

I have actually already done so, in that the approach in "A simple, first solution" is included in #1838, along with fixes for the other known version_info-related bugs. (I'm willing to go further, in one direction or another, but I'll wait for feedback on #1838 first.)