ishepard / pydriller

Python Framework to analyse Git repositories

Home Page:http://pydriller.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory leak over large repos

andreagurioli1995 opened this issue · comments

Describe the bug
I found out that the process when it comes to retrieving the modified files in the commit, starts requiring a massive amount of memory, being specific, after profiling with memray, the incriminated section is addressed to "return diff.data_stream.read().decode("utf-8", "ignore")" over the decoding of the diffs.

To Reproduce
Here is an easy snippet to reproduce the issue.

from pydriller import Repository
url = "https://github.com/mozilla/addons-server"
for commit in  Repository(path_to_repo=url,only_modifications_with_file_types=".py", num_workers=1).traverse_commits():
    a= commit.modified_files

OS Version:
Linux

Hey @andreagurioli1995!
Interesting..would you mind sharing more data on this?
How big the repos? How much memory is it used?

I remember I had a problem in the past where I was keeping an instance of the commit active, so I was putting everything in memory. It shouldn't be like this anymore, so I am not sure where to start from 😄

The repo used is this one https://github.com/mozilla/addons-server (55,161 commits on 09/01/2022 19:05 GMT+1), with the settings shown above! The version of pydriller used is the 2.1.
During the memory profiling with https://github.com/bloomberg/memray I obtained an overall usage of 62.4 GByte. I have attached a screenshot of the memory profiling with the RAM usage in red over the salient points.
Immagine 2022-11-09 184919

I also had the same problem。 my version: pydriller==2.1, python3.9.9, Linux: 5.10.0-60.18.0.50.h322_1.hce2.x86_64
I have tried manipulating the commit object in the new process and then freeing it, but the memory will still be used more and more. my code:

def get_commit_data(commit, meta: {}, file_num):
    result: dict = {}
    copy2dict(commit, result)
    targe_file = f"{result_file_path}/result_{file_num}.json"
    f = open(targe_file, mode='w')
    meta['rawData'] = result
    data = [meta]
    f.write(json.dumps(data))
    result = None
    meta = None
    data = None
    f.flush()
    f.close()
    print(f"written commit:{commit.hash} to file:{targe_file}")
    commit = None


def main():
    meta = get_meta()
    file_num = 0  
    for commit in Repository(path_to_repo=repo_path, only_in_branch=branch, from_commit=old_commit,
                             to_commit=new_commit).traverse_commits():
        p = Process(target=get_commit_data, args=(commit, meta.copy(), file_num))
        p.start()
        p.join()
        p.close()
        commit = None
        file_num += 1


if __name__ == '__main__':
    main()

I also had the same problem。 my version: pydriller==2.1, python3.9.9, Linux: 5.10.0-60.18.0.50.h322_1.hce2.x86_64 I have tried manipulating the commit object in the new process and then freeing it, but the memory will still be used more and more. my code:

def get_commit_data(commit, meta: {}, file_num):
    result: dict = {}
    copy2dict(commit, result)
    targe_file = f"{result_file_path}/result_{file_num}.json"
    f = open(targe_file, mode='w')
    meta['rawData'] = result
    data = [meta]
    f.write(json.dumps(data))
    result = None
    meta = None
    data = None
    f.flush()
    f.close()
    print(f"written commit:{commit.hash} to file:{targe_file}")
    commit = None


def main():
    meta = get_meta()
    file_num = 0  
    for commit in Repository(path_to_repo=repo_path, only_in_branch=branch, from_commit=old_commit,
                             to_commit=new_commit).traverse_commits():
        p = Process(target=get_commit_data, args=(commit, meta.copy(), file_num))
        p.start()
        p.join()
        p.close()
        commit = None
        file_num += 1


if __name__ == '__main__':
    main()

In the ”copy2dict“ method, I copied all the attributes of the commit to the result object (including nested attributes, DFS)

So I looked a bit into this, and there is definitely something wrong with the commit object and modified files.

For each commit I save the list of modified files (https://github.com/ishepard/pydriller/blob/master/pydriller/domain/commit.py#L509). I do it so that consecutive calls to the modified files don't need to call diff every time, since it's very expensive.

However, by keeping that I have huge memory consumptions. Don't know why, since the commit object should be deleted once we move to a new commit, there is no reference back to the object. Apparently I'm wrong 😄

I tested it by deleting that line, and now there is little to zero memory consumption.

Ok I know why, not sure how I didn't see it before. When I added multi-thread support, I transformed my commit generator to a list. Hence, all commits are referenced, even after being analyzed.

Need to change that. Looks like it's a bit complicated, I'll need to put some work on it and my Python skills are a bit rusty these days.

Problem should be solved now, I will release a new version soon 😄 feel free to test it in master

Tested with the same code over the master repo, and now it works perfectly, thanks!