"only_modifications_with_file_types" parameter causes "SHA ... could not be resolved" error when traversing commits
MichaelClasby0 opened this issue · comments
ValueError: SHA b'55ad57a235c009d0414aed1781072adda0c89137' could not be resolved, git returned: b'55ad57a235c009d0414aed1781072adda0c89137 missing'
Bug Report:
Description:
The provided code attempts to use the pydriller library to traverse commits in the TensorFlow repository and print the commit hashes of modifications made to files with the ".cpp" extension. However, executing the code results in a traceback error indicating that a specific SHA (commit hash) could not be resolved. This is the minimum reproducible example I could make. Removing the only_modifications_with_file_types
argument avoids the error. I have the error occurring for a number of different public repositories and commits, its worth noting that the commit attempted to read, is not actually valid (maybe it once was?).
I.e.
tensorflow/tensorflow@55ad57a -> 404
This is particularly troublesome as I cannot simply try catch the specific commit error since it occurs within the generator.
Steps to Reproduce:
- Run the provided code using the pydriller library version 2.4.1 and Python version 3.10.6.
import platform
import pydriller
from pydriller import Repository
print("pydriller version:", pydriller.__version__)
print("python version:", platform.python_version())
TENSORFLOW_REPO = "https://github.com/tensorflow/tensorflow.git"
for commit in Repository(TENSORFLOW_REPO, only_modifications_with_file_types=[".cpp"]).traverse_commits():
print(commit.hash)
Expected Output:
The expected output should be a list of commit hashes for modifications made to ".cpp" files in the TensorFlow repository.
Actual Output:
The actual output is as follows:
pydriller version: 2.4.1
python version: 3.10.6
Traceback (most recent call last):
File "/home/*****/*****/*****/*****/*****/pydriller_test.py", line 16, in <module>
File "/home/*****/.local/lib/python3.10/site-packages/pydriller/repository.py", line 236, in traverse_commits
for commit in job:
File "/home/*****/.local/lib/python3.10/site-packages/pydriller/repository.py", line 242, in _iter_commits
if self._conf.is_commit_filtered(commit):
File "/home/*****/.local/lib/python3.10/site-packages/pydriller/utils/conf.py", line 270, in is_commit_filtered
if not self._has_modification_with_file_type(commit):
File "/home/*****/.local/lib/python3.10/site-packages/pydriller/utils/conf.py", line 285, in _has_modification_with_file_type
for mod in commit.modified_files:
File "/home/*****/.local/lib/python3.10/site-packages/pydriller/domain/commit.py", line 716, in modified_files
return self._parse_diff(diff_index)
File "/home/*****/.local/lib/python3.10/site-packages/pydriller/domain/commit.py", line 728, in _parse_diff
"content": self._get_undecoded_content(diff.b_blob),
File "/home/*****/.local/lib/python3.10/site-packages/pydriller/domain/commit.py", line 752, in _get_undecoded_content
return blob.data_stream.read() if blob is not None else None
File "/home/*****/.local/lib/python3.10/site-packages/git/objects/base.py", line 142, in data_stream
return self.repo.odb.stream(self.binsha)
File "/home/*****/.local/lib/python3.10/site-packages/git/db.py", line 45, in stream
hexsha, typename, size, stream = self._git.stream_object_data(bin_to_hex(binsha))
File "/home/*****/.local/lib/python3.10/site-packages/git/cmd.py", line 1400, in stream_object_data
hexsha, typename, size = self.__get_object_header(cmd, ref)
File "/home/*****/.local/lib/python3.10/site-packages/git/cmd.py", line 1370, in __get_object_header
return self._parse_object_header(cmd.stdout.readline())
File "/home/*****/.local/lib/python3.10/site-packages/git/cmd.py", line 1331, in _parse_object_header
raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'55ad57a235c009d0414aed1781072adda0c89137' could not be resolved, git returned: b'55ad57a235c009d0414aed1781072adda0c89137 missing'
I thought it would be worth checking the latest commit, and it seems to be working fine on there.
Any plan on a release bump?
In the mean time I will
pip install git+https://github.com/ishepard/pydriller.git@241edcb413a681a2ceea27f9c711fc4f214dbafd
Hi! Thanks for posting!
Interesting, apparently my last change (kind of) fixed the issue: the problem is getting the source code of the file on that specific commit (that probably belongs to a submodule that doesn't exists anymore, hence the error).
Now I get the source code ONLY IF requested. So if you will get it, you'll get the error.
But since in the snippet you shared you don't, it seems it gets through.
Anyway, since it solves an issue I will publish a new version today 👍
Released 2.5, let me know :)