As an intermediate result of my master's project, I am sharing a dataset of all public Github repositories with at least 5 stars. It's available here on Kaggle.
This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. Please see the sample exploration notebook for some examples of what you can do!
The dataset is a JSON array of objects with the following fields:
name
: the name of the repositorydescription
: the description of the repositoryowner
: the Github username of the owner of the repositorystars
: the number of stars the repository hasforks
: the number of forks the repository haswatchers
: the number of watchers the repository hascreatedAt
: the date the repository was createdpushedAt
: the date the repository was last pushed tolicense
: the name of the license of the repositorycodeOfConduct
: the name of the code of conduct of the repositoryisFork
: whether the repository is a forkparent
: the name of the parent repository if the repository is a forkforkingAllowed
: whether forking is allowedisArchived
: whether the repository is archiveddiskUsageKb
: the size of the repository in kilobytesassignableUsersCount
: the number of assignable usersdefaultBranchCommitCount
: the number of commits on the default branchpullRequests
: the total number of pull requestsprimaryLanguage
: the primary language of the repositorylanguages
: the first 10 languages of the repository, a list of objects with the fieldsname
andsize
(ordered by size)topics
: the first 10 topics of the repository, a list of objects with the fieldsname
andstars
topicCount
: the number of topics the repository haslanguageCount
: the number of languages the repository hasissues
: the total number of issues
{
"owner": "pelmers",
"name": "text-rewriter",
"stars": 13,
"forks": 5,
"watchers": 4,
"isFork": false,
"isArchived": false,
"languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
"languageCount": 3,
"topics": [ { "name": "chrome-extension", "stars": 43211 } ],
"topicCount": 1,
"diskUsageKb": 75,
"pullRequests": 4,
"issues": 12,
"description": "Webextension to rewrite phrases in pages",
"primaryLanguage": "JavaScript",
"createdAt": "2015-03-14T22:35:11Z",
"pushedAt": "2022-02-11T14:26:00Z",
"defaultBranchCommitCount": 54,
"license": null,
"assignableUserCount": 1,
"codeOfConduct": null,
"forkingAllowed": true,
"nameWithOwner": "pelmers/text-rewriter",
"parent": null
}
This repository contains two things:
- A script to create the dataset,
get_all_github_repos.py
- An example notebook to show how to use the dataset,
explore_github_all_repos.ipynb
For more detailed background info, you can read my blog post.
The example notebook is also available on Kaggle.
The script get_all_github_repos.py
creates the dataset. To use it:
- Save a Github Personal Access Token (classic) in a file called
github_token
. - Edit
DEFAULT_CONFIG
at the top of the file to adjust the star and date windows. python get_all_github_repos.py
- Note: the script may take several days to run. It will first bisect the star and date space
into regions to outwit the Github API 1000 result limit. It saves this region information in
regions.pkl
- If the script ends before completion, you can therefore resume with
python get_all_github_repos.py --resume regions.pkl
This dataset is part of my master's thesis. As that has not been published yet (or, for that matter, started), you can cite this repository instead for now.
The Github API Terms of Service apply.
You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.