Note Running these scripts used up ~90gb disk space for me.
- remove hardcoded paths
- remove duplicated type definitions
- factor out iteration methods vs writers
- make paths of repos owner_repo instead of owner/repo to ease iteration
- src/: Typescript source for scripts
- bin/: location of scripts
- scripts/: (powershell) helpers for parallelizing work
- data/: outputs of scripts
- setup:
git clone https://github.com/aozgaa/scrapegithub.git c:\scrape
cd c:\scrape
npm install && npm run build
- get repos
node bin/crawler.js # ask github for the most popular repos
node bin/merger.js # merge the lists of repos collected in the previous step into a single list
node bin/fetcher.js # clone the repos from the previous step locally (with a fast network, takes an evening)
- get statistics
./scripts/getDocComments.ps1
-- scan through the repo owners from the previous step and extract doc comments for each one into a json filenode bin/getTagStats.js
-- read each json from the previous step and extract jsdoc tags and their usage-count from each onenode bin/proportionWithJsDoc.js
-- read each json from (1) and determine proportion of owners using jsdoc.node bin/proportionWithTaggedJsDoc.js
-- read each json from (2) and determine proportion of owners using jsdoc tags.node bin/mergeTagStats.js
-- read each json from the (2) and combine tag stats across all repos.