ishepard / pydriller

Python Framework to analyse Git repositories

Home Page:http://pydriller.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cant iterate over files in angularjs git-repo

Knniff opened this issue · comments

Describe the bug
While trying to iterate over all files in the angularjs git-repository (code used is show below) python crashes with this stack-trace:

Traceback (most recent call last):
  File "/home/test/project/main.py", line 12, in <module>
    for file in commit.modified_files:
  File "/home/test/project/.venv/lib/python3.10/site-packages/pydriller/domain/commit.py", line 716, in modified_files
    return self._parse_diff(diff_index)
  File "/home/test/project/.venv/lib/python3.10/site-packages/pydriller/domain/commit.py", line 728, in _parse_diff
    "content": self._get_undecoded_content(diff.b_blob),
  File "/home/test/project/.venv/lib/python3.10/site-packages/pydriller/domain/commit.py", line 752, in _get_undecoded_content
    return blob.data_stream.read() if blob is not None else None
  File "/home/test/project/.venv/lib/python3.10/site-packages/git/objects/base.py", line 142, in data_stream
    return self.repo.odb.stream(self.binsha)
  File "/home/test/project/.venv/lib/python3.10/site-packages/git/db.py", line 45, in stream
    hexsha, typename, size, stream = self._git.stream_object_data(bin_to_hex(binsha))
  File "/home/test/project/.venv/lib/python3.10/site-packages/git/cmd.py", line 1400, in stream_object_data
    hexsha, typename, size = self.__get_object_header(cmd, ref)
  File "/home/test/project/.venv/lib/python3.10/site-packages/git/cmd.py", line 1370, in __get_object_header
    return self._parse_object_header(cmd.stdout.readline())
  File "/home/test/project/.venv/lib/python3.10/site-packages/git/cmd.py", line 1331, in _parse_object_header
    raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'4e1ebfdefda333354bbda71e172daa5db4808616' could not be resolved, git returned: b'4e1ebfdefda333354bbda71e172daa5db4808616 missing'

This is probably not an error/bug in pydriller directly, but because of my limited knowledge of the underlying libraries i couldnt reproduce the error otherwise. I tried to reproduce it with this but got no error:

from git import Repo

repo = Repo('./angular')
commits = repo.iter_commits()

for commit in commits:
    for file in commit.tree.blobs:
        print(file.name)

To Reproduce
Clone https://github.com/angular/angular and try to iterate over all files with:

for commit in pydriller.Repository(
    "./angular").traverse_commits():
    for file in commit.modified_files:
        print(file.filename, file.change_type)

OS Version:
Linux: Ubuntu 22.04 with Python 3.10.6 and PyDriller 2.4.1

Hi @Knniff! If it can't be repro with GitPython it means it's something related to Pydriller. Thanks for flagging, I'll look into it :)

The problem can be repro-ed in GitPython as well. The problem is that the commit belongs to a sub-project:

diff --git a/tools/js2dart b/tools/js2dart
new file mode 160000
index 0000000000000000000000000000000000000000..4e1ebfdefda333354bbda71e172daa5db4808616
--- /dev/null
+++ b/tools/js2dart
@@ -0,0 +1 @@
+Subproject commit 4e1ebfdefda333354bbda71e172daa5db4808616

Unfortunately, what I generally do in these cases is to run git submodules --init. However, it seems that angular stopped using submodules, so nothing happens.
The only thing you can do at this point is to add a try/catch.

However, you made me notice a something, Pydriller shouldn't probably return an exception in this case.

Would a try/catch mean that the repository gets processed further after the error happens?

Yep that will do 👍