gitpython-developers / GitPython

The git.cmd.Git.refresh class method includes this code:

Lines 478 to 481 in d28c20b

    
           # We get here if this was the init refresh and the refresh mode was not 
        
           # error. Go ahead and set the GIT_PYTHON_GIT_EXECUTABLE such that we 
        
           # discern the difference between a first import and a second import. 
        
           cls.GIT_PYTHON_GIT_EXECUTABLE = cls.git_exec_name

It is unclear what "the difference between a first import and a second import" is referring to. (I should have noticed this issue when working on #1725, in which I actually edited that comment to improve its formatting, capitalization, and punctuation, but I apparently did not notice it at that time.)

That code can run multiple times, accessing the same class-wide state each time, because Git.refresh can be called multiple times (and, being public, may reasonably be called by code outside of GitPython). But this is not related to how many imports, or import attempts, occur. The comment would make sense if Python's import system worked differently, but seems at odds with the actual design. As I understand it:

If an import succeeds, the resulting module object is cached in sys.modules. Subsequent imports use the same module object and do not rerun the module's code.

For example, suppose hello.py contains:
```
print("Hello, world!")
```
Then running import hello from a REPL prints the greeting the first time it is run, but running it subsequent times produces no repeated side effect.
If an import fails, the new module object, which may itself be incompletely populated, is removed from sys.modules, before control returns via the raised exception propagating up the call stack to the code that attempted the import. Even if that module object turns out still to exist, it will not be reused in subsequent attempts to import the code. A subsequent import attempt will rerun the module's code from scratch, creating a new module object which serves as a new namespace. The state in the old module object is not used; global variables of the module, including classes, and thus also including their attributes, are all new.

For example, suppose throw.py contains:
```
import sys

me = sys.modules[__name__]
try:
    me.x += 1
except AttributeError:
    me.x = 1

try:
    a = sys.a
except AttributeError:
    a = sys.a = []
a.append(me)

raise ValueError(x)
```
Then running import throw from a REPL raises ValueError with a value of 1 each time it is run, and sys.a is populated with the abandoned module object from each attempt, all of which are separate objects. (Importing sys in the REPL allows it to be inspected, since import sys succeeds the first time it occurs, whether that is in the REPL or in throw.py, and thus all import sys get the same sys module object with the same a attribute.)
If the contents of sys.modules are explicitly modified (rather than implicitly, due to the effect of import and from statements), one of the above two situations still applies, depending on the nature of the mutation. The exception to this would be if a reference to the old module object is obtained and patched back into sys.modules. That would be a very unusual thing to do, except perhaps for purposes of debugging or experimentation, and I'm reasonably sure the intent of that comment--and the code it documents--is not to cover such cases. Furthermore, that would still not rerun top-level code of the module.
Python's import system is extensible and can be made to work differently from the usual rules. Of course, this cannot generally be anticipated by library code, and I likewise doubt this is a factor in the design or documentation.

I suspect that the code is correct and the comment incorrect, but since I'm not sure what was originally intended, I am not entirely sure how to fix it. It could be changed to say something like "discern the difference between the initial refresh at import time and subsequent calls," but this is very general and I worry it would paper over whatever really ought to be said. On the other hand, perhaps this really is all that was meant.

I haven't looked into the history of the Python import system but I don't think it operated in a fundamentally different way at the time this explanation was written. This code, and the original accompanying comment, was added in #640 (in 2b3e769). The reasoning given in that pull request, for the actual code change, continues to makes sense, and does not seem to have relied on any incorrect--nor any now-outdated--assumptions about the import system. As far as I can tell, it is only the comment that should be improved.

I agree, the comment suffered from 'rot' over time and should be adjusted. I definitely don't recall to have done anything smart about imports, and if so, it's apparently gone.

To check if I was missing anything historical that should be specifically updated, I've just tried the throw.py experiment described above, but on an Ubuntu 18.04.6 LTS system-provided python2.7 (2.7.17-1~18.04ubuntu1.13+esm4). It behaves the same.

However, in Python 2 there was a reload builtin. This functionality remains available in Python 3 via importlib.reload, but it was in practice more often used in Python 2 when it was provided by a builtin. Its behavior is to reload a module, rerunning all its top level statements (thus including class statements and their top-level statements), but using the same module object as the namespace to write to.

Reloading the git module reruns its contents but not those of its git.cmd submodule, and the effect may be what "the difference between a first import and a second import" was describing. In Python 3 (since GitPython doesn't support Python 2 anymore):

$ GIT_PYTHON_GIT_EXECUTABLE=nonexistent-command GIT_PYTHON_REFRESH=warn python
Python 3.12.1 (main, Dec 10 2023, 15:07:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import importlib

>>> import git
WARNING: Bad git executable.
The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()

All git commands will error until this is rectified.

This initial warning can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - quiet|q|silence|s|none|n|0: for no warning or exception
    - warn|w|warning|1: for a printed warning
    - error|e|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet

>>> importlib.reload(git)
Traceback (most recent call last):
  File "/home/ek/repos-wsl/GitPython/git/__init__.py", line 140, in <module>
    refresh()
  File "/home/ek/repos-wsl/GitPython/git/__init__.py", line 127, in refresh
    if not Git.refresh(path=path):
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ek/repos-wsl/GitPython/git/cmd.py", line 485, in refresh
    raise GitCommandNotFound("git", err)
git.exc.GitCommandNotFound: Cmd('git') not found due to: 'Bad git executable.
The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()
'
  cmdline: git

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.12/importlib/__init__.py", line 131, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 866, in _exec
  File "<frozen importlib._bootstrap_external>", line 994, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/home/ek/repos-wsl/GitPython/git/__init__.py", line 142, in <module>
    raise ImportError("Failed to initialize: {0}".format(_exc)) from _exc
ImportError: Failed to initialize: Cmd('git') not found due to: 'Bad git executable.
The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()
'
  cmdline: git

If this (but with the more commonly used reload builtin) is what "the difference between a first import and a second import" referred to, then I think it is okay to go ahead with changing the wording to something like "the difference between the initial refresh at import time and subsequent calls." I've proposed a minor variation on that in #1810.

	# We get here if this was the init refresh and the refresh mode was not
	# error. Go ahead and set the GIT_PYTHON_GIT_EXECUTABLE such that we
	# discern the difference between a first import and a second import.
	cls.GIT_PYTHON_GIT_EXECUTABLE = cls.git_exec_name

Explanation makes unclear reference to "first" and "second" imports