An API that returns a file size summary of a given GitHub public repository, grouped by file extension. Live demo available at https://labb-gh-summary.herokuapp.com
.
GET https://labb-gh-summary.herokuapp.com/?repo=abbluiz/ps1-gen
Where abbluiz/ps1-gen
is a GitHub public repository in the form of owner/repo
. It will check if this is a valid owner/repo
combo. If it is, it will perform a request to determine what is the default branch of the repo (e. g. master, main, or anything else). If the request returns a 404 error, it most probably means that the repo does not exist or it's not public.
Status Code: 200 (OK)
{
"": {
"lines": 21,
"bytes": 1065
},
".md": {
"lines": 33,
"bytes": 928
},
".sh": {
"lines": 118,
"bytes": 6144
}
}
Where ""
represents the group of files that does not have any extensions. Bytes will be an estimate, because with web scraping you can't always know for sure the size of the file, because it can be displayed as KB or MB, which will be rounded.
Furthermore, not all repo files have "lines" (e.g. image files, executables, etc). In this case, the bytes are counted but the lines are considered to be 1 on each of those files.
You can test with the live demo at https://labb-gh-summary.herokuapp.com
, or install node.js & npm, and clone this repository. After that, just run npm install
inside the repository directory.
Just run npm run start
and be happy.
First of all, you must create a directory named config
inside the root of the repository files.
Production: a file named prod.env
must be created inside the config directory. There you can set PORT={NUMBER}
to change the port and PROMISCUOUS=true|false
to enable/disable promiscuous mode (disabled by default). Start the API with npm run start2
.
Development: auto reload with nodemon will be enabled; same instructions, but file must be named dev.env
and npm run dev
must be runned to start API.
Testing: it will perform some automatic tests; same instructions, but file must be named test.env
and npm run test
must be runned to start API.
?repo=owner/repo
: whereowner
must be a valid GitHub username (individual or organization), andrepo
must be a valid GitHub repository name.
GitHub usernames cannot have more than 39 characters; they accept hyphens (-) but cannot start or finish with hyphens or have consecutive hyphens; usernames only accepts alphanumeric characters and it's case insensitive.
Repository names are more flexible: they can have up to 100 characters; can have alphanumeric characters, dashes, dots, and underscores.
Shout out to this CC0-licensed repository for information about this: https://github.com/shinnn/github-username-regex.
?mode=moderate|polite|promiscuous
: only these three modes are acceptable, howeverpromiscuous
is not enabled by default. The default value ismoderate
.
This will set the "agressiveness" of the web scraping recursion performed by the API. polite
mode is the slowest and nicest for GitHub, moderate
is slow and OK for GitHub, and promiscuous
is the fastest and quite nasty for GitHub. In fact, promiscuous
mode will most certainly cause GitHub to answer with 429 Errors (Too Many Requests) if the given repository is big enough.
Once a request is made, it will enter in a job queue. While it is being resolved or waiting in the queue, the API will answer with the following response:
Status Code: 202 (Accepted)
{
"info": "Server has started building repository summary. Come back in a moment for the results."
}
You must perform another request to the same URL (with the same parameters) in another time to get the results. Once the results are returned as demonstrated from the earlier example, it will remain in a cache for 6 hours before expiring.
The web scraping technique utilized in this API consists in creating a "fake DOM" on the server. Once this DOM is loaded, you can use jQuery-like CSS selectors to filter the data of the pages.
In order to get the fake DOM loaded, I have used Cheerio.
The API will crawl the repo files in a recursive manner. Each time the recursion finds a file page, it uses the fake DOM to find information about the size (lines and bytes of a repository file) in the page, and updates an object with the sum of the lines and bytes for a particular file extension.
The moderate
mode is the default. It will perform concurrent requests to GitHub in chunks of a maximum of 5 requests, and with a 2-second delay interval between chunks. This is faster than polite
mode, but still much slower than promiscuous
mode. It will handle any repository size.
The polite
mode will traverse through the repository in a recursive manner, just like the other modes, but it will not perform concurrent requests, waiting for each request to finish in order to start the next. It will handle any repository size.
The promiscuous
mode is not enabled by default. It will use recursion in a manner which will make dozens of requests per second by making unlimited concurrent requests during recursion. However, this will only work with small repositories, or medium-sized repository that does not have multiple files in one directory. This is because GitHub will not tolerate this ammount of requests and will send 429 errors back.
This will not be enabled by default and it is not enabled on the live demo. To enable it, you must follow the Setting Up Environment section and add this line to the environment file:
PROMISCUOUS=true