chaoss / grimoirelab-perceval

Send Sir Perceval on a quest to retrieve and gather data from software repositories.

Home Page:http://perceval.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Error] rate limit

anajsana opened this issue · comments

I've used perceval.backends.core.github python library to fetch github data from a set of repos (26 repos). When executing the script, I got a rate limit error:

anajs@anajsana:~/Desktop/TFM$ python opendistro-git.py 
https://github.com/opendistro-for-elasticsearch/security.git
WARNING:perceval.backends.core.github:Rate limit not initialized: 401 Client Error: Unauthorized for url: https://api.github.com/rate_limit

It seems the rate limit starts from the very beginning, so I don't understand the reason. It might be because I'm not giving all the required token permissions? Permissions I gave were: admin:org, admin:public_key, read:packages, repo and user.

This is the script I'm using:

import requests
import time
import json
import logging
from datetime import datetime 

from perceval.backends.core.git import Git
from perceval.backends.core.github import GitHub

def extract_data(repo):

    data_repo = GitHub(owner="opendistro-for-elasticsearch", repository=repo, api_token="xxxxx")
    for item in data_repo.fetch(from_date=datetime.strptime('2020-01-01', '%Y-%m-%d')):
        print(item)
        return

def owner_repositories(name):
    query = "org:{}".format(name)
    page = 1
    repositories = []

    r = requests.get('https://api.github.com/search/repositories?q={}&page={}'.format(query, page))
    items = r.json()['items']

    while len(items) > 0:
        # github API has rate limit. This ask for a 3 sec inactivity
        time.sleep(3)
        for item in items:
            logging.info("Adding {} repository".format(item['clone_url']))
            repositories.append(item['clone_url'])
        page += 1
        r = requests.get('https://api.github.com/search/repositories?q={}&page={}'.format(query, page))
        items = r.json()['items']

    return repositories

def main():
    repos = owner_repositories('opendistro-for-elasticsearch')
    for repo in repos:
        #print(repo)
        extract_data(repo)

if __name__ == '__main__':
    main()

Hi @anajsana
I think the access token should be parsed as a list of tokens. Ref: github.py#L82

So, the solution to your problem would be

def extract_data(repo):

-   data_repo = GitHub(owner="opendistro-for-elasticsearch", repository=repo, api_token="xxxxx")
+   data_repo = GitHub(owner="opendistro-for-elasticsearch", repository=repo, api_token=["token1","token2"])
    for item in data_repo.fetch(from_date=datetime.strptime('2020-01-01', '%Y-%m-%d')):
        print(item)
        return

I guess this should solve the issue.

EDIT: I just tried the solution on your script and it worked. 😄
Please let me know how it goes for you.

Also, WRT permissions,

It might be because I'm not giving all the required token permissions? Permissions I gave were: admin:org, admin:public_key, read:packages, repo and user.

I think repo is enough for perceval to retrieve the data.

In addition to what @vchrombie said, I would recommend to use the sleep_for_rate option. This will pause the collection until the rate limit is reset.

data_repo = GitHub(owner="opendistro-for-elasticsearch", repository=repo, api_token=["<token>"], sleep_for_rate=True)

Note that if you activate this option, you may also need to set the sleep_time parameter (If I'm not wrong, GitHub API limit is reset every hour).

yes! It worked 👍 Thank you so much for your help and quick response @vchrombie and @mafesan