ishepard / pydriller

Python Framework to analyse Git repositories

Home Page:http://pydriller.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"only_modifications_with_file_types" parameter causes "SHA ... could not be resolved" error when traversing commits

MichaelClasby0 opened this issue · comments

ValueError: SHA b'55ad57a235c009d0414aed1781072adda0c89137' could not be resolved, git returned: b'55ad57a235c009d0414aed1781072adda0c89137 missing'

Bug Report:

Description:
The provided code attempts to use the pydriller library to traverse commits in the TensorFlow repository and print the commit hashes of modifications made to files with the ".cpp" extension. However, executing the code results in a traceback error indicating that a specific SHA (commit hash) could not be resolved. This is the minimum reproducible example I could make. Removing the only_modifications_with_file_types argument avoids the error. I have the error occurring for a number of different public repositories and commits, its worth noting that the commit attempted to read, is not actually valid (maybe it once was?).
I.e.
tensorflow/tensorflow@55ad57a -> 404

This is particularly troublesome as I cannot simply try catch the specific commit error since it occurs within the generator.

Steps to Reproduce:

  1. Run the provided code using the pydriller library version 2.4.1 and Python version 3.10.6.
import platform
import pydriller
from pydriller import Repository

print("pydriller version:", pydriller.__version__)
print("python version:", platform.python_version())

TENSORFLOW_REPO = "https://github.com/tensorflow/tensorflow.git"

for commit in Repository(TENSORFLOW_REPO, only_modifications_with_file_types=[".cpp"]).traverse_commits():
    print(commit.hash)

Expected Output:
The expected output should be a list of commit hashes for modifications made to ".cpp" files in the TensorFlow repository.

Actual Output:
The actual output is as follows:

pydriller version: 2.4.1
python version: 3.10.6
Traceback (most recent call last):
  File "/home/*****/*****/*****/*****/*****/pydriller_test.py", line 16, in <module>
  File "/home/*****/.local/lib/python3.10/site-packages/pydriller/repository.py", line 236, in traverse_commits
    for commit in job:
  File "/home/*****/.local/lib/python3.10/site-packages/pydriller/repository.py", line 242, in _iter_commits
    if self._conf.is_commit_filtered(commit):
  File "/home/*****/.local/lib/python3.10/site-packages/pydriller/utils/conf.py", line 270, in is_commit_filtered
    if not self._has_modification_with_file_type(commit):
  File "/home/*****/.local/lib/python3.10/site-packages/pydriller/utils/conf.py", line 285, in _has_modification_with_file_type
    for mod in commit.modified_files:
  File "/home/*****/.local/lib/python3.10/site-packages/pydriller/domain/commit.py", line 716, in modified_files
    return self._parse_diff(diff_index)
  File "/home/*****/.local/lib/python3.10/site-packages/pydriller/domain/commit.py", line 728, in _parse_diff
    "content": self._get_undecoded_content(diff.b_blob),
  File "/home/*****/.local/lib/python3.10/site-packages/pydriller/domain/commit.py", line 752, in _get_undecoded_content
    return blob.data_stream.read() if blob is not None else None
  File "/home/*****/.local/lib/python3.10/site-packages/git/objects/base.py", line 142, in data_stream
    return self.repo.odb.stream(self.binsha)
  File "/home/*****/.local/lib/python3.10/site-packages/git/db.py", line 45, in stream
    hexsha, typename, size, stream = self._git.stream_object_data(bin_to_hex(binsha))
  File "/home/*****/.local/lib/python3.10/site-packages/git/cmd.py", line 1400, in stream_object_data
    hexsha, typename, size = self.__get_object_header(cmd, ref)
  File "/home/*****/.local/lib/python3.10/site-packages/git/cmd.py", line 1370, in __get_object_header
    return self._parse_object_header(cmd.stdout.readline())
  File "/home/*****/.local/lib/python3.10/site-packages/git/cmd.py", line 1331, in _parse_object_header
    raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'55ad57a235c009d0414aed1781072adda0c89137' could not be resolved, git returned: b'55ad57a235c009d0414aed1781072adda0c89137 missing'

I thought it would be worth checking the latest commit, and it seems to be working fine on there.
Any plan on a release bump?

In the mean time I will
pip install git+https://github.com/ishepard/pydriller.git@241edcb413a681a2ceea27f9c711fc4f214dbafd

Hi! Thanks for posting!

Interesting, apparently my last change (kind of) fixed the issue: the problem is getting the source code of the file on that specific commit (that probably belongs to a submodule that doesn't exists anymore, hence the error).
Now I get the source code ONLY IF requested. So if you will get it, you'll get the error.
But since in the snippet you shared you don't, it seems it gets through.

Anyway, since it solves an issue I will publish a new version today 👍

Released 2.5, let me know :)