Create es river that brings data into elasticboard

Question

Create es river that brings data into elasticboard

mishu- opened this issue 11 years ago · comments

Need

Right now the current implementation relies on parsing dump files updated via a cron, it would be nice to have a cleaner way to bring it github data to the es database

Proposed Solution

Create an es river (http://www.elasticsearch.org/blog/the-river/) which pulls data from github to es directly.

Notes

This issue is a stub.

Mihai Oprea · Answer 1 · Mon Jan 06 2014 18:07:55 GMT+0800 (China Standard Time)

cc @mihneadb

Mihnea Dobrescu-Balaur · Answer 2 · Mon Jan 06 2014 18:23:32 GMT+0800 (China Standard Time)

Worth mentioning that the way to bring data in is either:

subscribe with a listener to github's API for the given repo
poll the events every delta T for new events and check for dupes

1 requires an action from someone who has access rights to the repo (even if it's a public repo), 2 doesn't.

Mihai Oprea · Answer 3 · Mon Jan 06 2014 18:30:39 GMT+0800 (China Standard Time)

Can you please provide the links for documentation for both 1 and 2 pls?

Mihnea Dobrescu-Balaur · Answer 4 · Mon Jan 06 2014 18:38:21 GMT+0800 (China Standard Time)

http://developer.github.com/v3/repos/hooks/
http://developer.github.com/v3/activity/events/ (ctrl-f for 300 :) )

Part of the email conversation:

Hey Mihnea,

> How do you suggest I get the events that I haven't seen so far? Using a timestamp? The github archive scraper collects lots of events, I'm guessing more than 300 so there should be a way, right?

As I mentioned before, we can only provide a history of up to 300 events currently. If you need to collect more than that, the only way to do it is to periodically fetch events from the API and store them locally. I'm guessing that the (Unofficial) GitHub Archive project is doing exactly that - polling our API with a high frequency to pick up all events. If you need to go further back in history and need to do it now - there is no workaround for that except querying the archive project.

> By doing a simple check (cat | sort | uniq | wc -l) I found that indeed there are just 300 unique events. However, your API didn't reply with "last" as a page number, it let my script keep polling.

Ooops, sorry about that! I noticed that our documentation says that we will return a "last" link, and in fact we aren't. I'll see if we can do something to correct that - thanks for the report!

Glad you were able to figure out what was going on! Let me know if you have any other questions or feedback.

Cheers,
Ivan

Mihnea Dobrescu-Balaur · Answer 5 · Wed Jan 08 2014 19:29:09 GMT+0800 (China Standard Time)

Found some useful resources for this (there doesn't seem to be an already-implemented gh river).

https://github.com/elasticsearch/elasticsearch-river-twitter/blob/master/src/main/java/org/elasticsearch/river/twitter/TwitterRiver.java
http://blog.trifork.com/2013/01/10/how-to-write-an-elasticsearch-river-plugin/

Mihnea Dobrescu-Balaur · Answer 6 · Thu Jan 23 2014 18:02:08 GMT+0800 (China Standard Time)

https://github.com/uberVU/elasticsearch-river-github