microsoft / ghcrawler

Crawl GitHub APIs and store the discovered orgs, repos, commits, ...

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for traversing Releases

jeffmcaffer opened this issue · comments

The repo traversal currently does not gather Releases and while the ReleaseEvent is harvested, it also does not fetch the actual Release document.

I'd be interested in adding support for this, as tracking releases and associated downloads is something I'd very much like to see on our dashboard. Can you suggest a good PR to look at as an example of how to begin? Or a good section of code I could get some mileage out of reusing to start testing locally?

Awesome. In the GitHub processor there is a function that handles repos. The input there is a request that has a document. The bulk of that function teases apart the document and adds links to things and queues up things to be traversed. At the end of that function you would add something like

this._addCollection(request, 'releases', 'release');

Then you need to implement a function called release that will be called the a document containing the release response as detailed in the GitHub API. That function should then do any processing you want to do wrt the assets in release etc.

You can start with review() as an example of basic entity processing.

Note that you will also have to add releases to the list of collections in isCollectionType()

I suggest that you sketch something out, send in a PR and we can iterate. We are more than happy to help with the details.

Check out the processor tests as well. This is helpful to see what the inputs and outputs are. In particular the repo tests.

Also check out the wiki and in particular the data architecture.