Explanation makes unclear reference to "first" and "second" imports
EliahKagan opened this issue · comments
The git.cmd.Git.refresh
class method includes this code:
Lines 478 to 481 in d28c20b
It is unclear what "the difference between a first import and a second import" is referring to. (I should have noticed this issue when working on #1725, in which I actually edited that comment to improve its formatting, capitalization, and punctuation, but I apparently did not notice it at that time.)
That code can run multiple times, accessing the same class-wide state each time, because Git.refresh
can be called multiple times (and, being public, may reasonably be called by code outside of GitPython). But this is not related to how many imports, or import attempts, occur. The comment would make sense if Python's import system worked differently, but seems at odds with the actual design. As I understand it:
-
If an import succeeds, the resulting module object is cached in
sys.modules
. Subsequent imports use the same module object and do not rerun the module's code.For example, suppose
hello.py
contains:print("Hello, world!")
Then running
import hello
from a REPL prints the greeting the first time it is run, but running it subsequent times produces no repeated side effect. -
If an import fails, the new module object, which may itself be incompletely populated, is removed from
sys.modules
, before control returns via the raised exception propagating up the call stack to the code that attempted the import. Even if that module object turns out still to exist, it will not be reused in subsequent attempts to import the code. A subsequent import attempt will rerun the module's code from scratch, creating a new module object which serves as a new namespace. The state in the old module object is not used; global variables of the module, including classes, and thus also including their attributes, are all new.For example, suppose
throw.py
contains:import sys me = sys.modules[__name__] try: me.x += 1 except AttributeError: me.x = 1 try: a = sys.a except AttributeError: a = sys.a = [] a.append(me) raise ValueError(x)
Then running
import throw
from a REPL raisesValueError
with a value of1
each time it is run, andsys.a
is populated with the abandoned module object from each attempt, all of which are separate objects. (Importingsys
in the REPL allows it to be inspected, sinceimport sys
succeeds the first time it occurs, whether that is in the REPL or inthrow.py
, and thus allimport sys
get the samesys
module object with the samea
attribute.) -
If the contents of
sys.modules
are explicitly modified (rather than implicitly, due to the effect ofimport
andfrom
statements), one of the above two situations still applies, depending on the nature of the mutation. The exception to this would be if a reference to the old module object is obtained and patched back intosys.modules
. That would be a very unusual thing to do, except perhaps for purposes of debugging or experimentation, and I'm reasonably sure the intent of that comment--and the code it documents--is not to cover such cases. Furthermore, that would still not rerun top-level code of the module. -
Python's import system is extensible and can be made to work differently from the usual rules. Of course, this cannot generally be anticipated by library code, and I likewise doubt this is a factor in the design or documentation.
I suspect that the code is correct and the comment incorrect, but since I'm not sure what was originally intended, I am not entirely sure how to fix it. It could be changed to say something like "discern the difference between the initial refresh at import time and subsequent calls," but this is very general and I worry it would paper over whatever really ought to be said. On the other hand, perhaps this really is all that was meant.
I haven't looked into the history of the Python import system but I don't think it operated in a fundamentally different way at the time this explanation was written. This code, and the original accompanying comment, was added in #640 (in 2b3e769). The reasoning given in that pull request, for the actual code change, continues to makes sense, and does not seem to have relied on any incorrect--nor any now-outdated--assumptions about the import system. As far as I can tell, it is only the comment that should be improved.
I agree, the comment suffered from 'rot' over time and should be adjusted. I definitely don't recall to have done anything smart about imports, and if so, it's apparently gone.
To check if I was missing anything historical that should be specifically updated, I've just tried the throw.py
experiment described above, but on an Ubuntu 18.04.6 LTS system-provided python2.7
(2.7.17-1~18.04ubuntu1.13+esm4). It behaves the same.
However, in Python 2 there was a reload
builtin. This functionality remains available in Python 3 via importlib.reload
, but it was in practice more often used in Python 2 when it was provided by a builtin. Its behavior is to reload a module, rerunning all its top level statements (thus including class
statements and their top-level statements), but using the same module object as the namespace to write to.
Reloading the git
module reruns its contents but not those of its git.cmd
submodule, and the effect may be what "the difference between a first import and a second import" was describing. In Python 3 (since GitPython doesn't support Python 2 anymore):
$ GIT_PYTHON_GIT_EXECUTABLE=nonexistent-command GIT_PYTHON_REFRESH=warn python
Python 3.12.1 (main, Dec 10 2023, 15:07:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import importlib
>>> import git
WARNING: Bad git executable.
The git executable must be specified in one of the following ways:
- be included in your $PATH
- be set via $GIT_PYTHON_GIT_EXECUTABLE
- explicitly set via git.refresh()
All git commands will error until this is rectified.
This initial warning can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
- quiet|q|silence|s|none|n|0: for no warning or exception
- warn|w|warning|1: for a printed warning
- error|e|raise|r|2: for a raised exception
Example:
export GIT_PYTHON_REFRESH=quiet
>>> importlib.reload(git)
Traceback (most recent call last):
File "/home/ek/repos-wsl/GitPython/git/__init__.py", line 140, in <module>
refresh()
File "/home/ek/repos-wsl/GitPython/git/__init__.py", line 127, in refresh
if not Git.refresh(path=path):
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ek/repos-wsl/GitPython/git/cmd.py", line 485, in refresh
raise GitCommandNotFound("git", err)
git.exc.GitCommandNotFound: Cmd('git') not found due to: 'Bad git executable.
The git executable must be specified in one of the following ways:
- be included in your $PATH
- be set via $GIT_PYTHON_GIT_EXECUTABLE
- explicitly set via git.refresh()
'
cmdline: git
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.12/importlib/__init__.py", line 131, in reload
_bootstrap._exec(spec, module)
File "<frozen importlib._bootstrap>", line 866, in _exec
File "<frozen importlib._bootstrap_external>", line 994, in exec_module
File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
File "/home/ek/repos-wsl/GitPython/git/__init__.py", line 142, in <module>
raise ImportError("Failed to initialize: {0}".format(_exc)) from _exc
ImportError: Failed to initialize: Cmd('git') not found due to: 'Bad git executable.
The git executable must be specified in one of the following ways:
- be included in your $PATH
- be set via $GIT_PYTHON_GIT_EXECUTABLE
- explicitly set via git.refresh()
'
cmdline: git
If this (but with the more commonly used reload
builtin) is what "the difference between a first import and a second import" referred to, then I think it is okay to go ahead with changing the wording to something like "the difference between the initial refresh at import time and subsequent calls." I've proposed a minor variation on that in #1810.