Create es river that brings data into elasticboard
mishu- opened this issue · comments
Need
Right now the current implementation relies on parsing dump files updated via a cron, it would be nice to have a cleaner way to bring it github data to the es database
Proposed Solution
Create an es river (http://www.elasticsearch.org/blog/the-river/) which pulls data from github to es directly.
Notes
This issue is a stub.
cc @mihneadb
Worth mentioning that the way to bring data in is either:
- subscribe with a listener to github's API for the given repo
- poll the events every delta T for new events and check for dupes
1
requires an action from someone who has access rights to the repo (even if it's a public repo), 2
doesn't.
Can you please provide the links for documentation for both 1 and 2 pls?
- http://developer.github.com/v3/repos/hooks/
- http://developer.github.com/v3/activity/events/ (ctrl-f for 300 :) )
Part of the email conversation:
Hey Mihnea,
> How do you suggest I get the events that I haven't seen so far? Using a timestamp? The github archive scraper collects lots of events, I'm guessing more than 300 so there should be a way, right?
As I mentioned before, we can only provide a history of up to 300 events currently. If you need to collect more than that, the only way to do it is to periodically fetch events from the API and store them locally. I'm guessing that the (Unofficial) GitHub Archive project is doing exactly that - polling our API with a high frequency to pick up all events. If you need to go further back in history and need to do it now - there is no workaround for that except querying the archive project.
> By doing a simple check (cat | sort | uniq | wc -l) I found that indeed there are just 300 unique events. However, your API didn't reply with "last" as a page number, it let my script keep polling.
Ooops, sorry about that! I noticed that our documentation says that we will return a "last" link, and in fact we aren't. I'll see if we can do something to correct that - thanks for the report!
Glad you were able to figure out what was going on! Let me know if you have any other questions or feedback.
Cheers,
Ivan
Found some useful resources for this (there doesn't seem to be an already-implemented gh river).
https://github.com/elasticsearch/elasticsearch-river-twitter/blob/master/src/main/java/org/elasticsearch/river/twitter/TwitterRiver.java
http://blog.trifork.com/2013/01/10/how-to-write-an-elasticsearch-river-plugin/