simonrenger / collect-data-from-github

A little tool collection to help you collecting data from github for research

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

collect-data-from-github

A little tool collection to help you collecting data from GitHub for research. This tool is based on my blog post: Systematic review of repositories on GitHub with python (Game Dev Style)

Note: This repository is inspired by the work of: Department of Information and Computing Sciences, Utrecht University: A Systematic Review of Open Source Clinical Software on GitHub for Improving Software Reuse in Smart Healthcare by Zhengru Shen and Marco Spruit.

Install

$ git clone https://github.com/simonrenger/collect-data-from-github.git
$  pip install PyGithub
$ pip install pandas

How to use tool collect.py

Call the help function:

python collect.py --help

You need to provide a config.json file:

Field Type Optional Description
token string Yes If present it should contain a valid GitHub Token. You can obtain it here: Settings/Token. Scopes: repos. If not provided --token {TOKEN} needs to be used
readme_dir string Yes If present the tool will automatically download GitHub readme files into this location.
output string Yes If present the tool will store the found data in this location. Default: ./
format string yes If present it determines the output format. Valid input: JSON, CSV, HTML, MARKDOWN. Default: CSV
criteria object No Must contain a entry called time with the fields min or max
terms array No List of search terms in accordance to the GitHub Syntax API: Understanding the search syntax
attrs array No List of attributes from the repo GitHub REST API object

Note: There is a sample config in the samples folder

The previous command will give you some ideas on how to run it. But there is a faster way:

python collect.py config.json

And if you want to pass a token along:

python collect.py --token my_token config.json 

Roadmap

  • Add more criteria to filter repos on e.g. Languages
  • Add possibility to avoid archived repos if wanted

About

A little tool collection to help you collecting data from github for research

License:Apache License 2.0


Languages

Language:Python 100.0%