logstash-plugins / logstash-input-s3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multiple Broker/Indexers ingesting

cdenneen opened this issue · comments

I'd like to ask about the ability or maybe how would you have multiple logstash servers (for HA) possibly pulling from same s3 input but doesn't share sincedb_path.
Could you use a NFS/GFS filesystem and have more than one instance of logstash using the same sincedb file?
This might not even be possible but would be really helpful when s3 input threads might die (logstash still running) and now ingestion has stopped.
Obviously fixing the s3 input thread from dying is the correct fix but for HA if the LS node died it would be nice if you could have 2 running so could fix node without downstream data loss/backup/delay.

Any update on this?

Happy one-year anniversary.

Logstash currently has no mechanism for internode coordination (two logstash nodes coordinating work efforts), and my best guess is that we would need a coordination mechanism in order to achieve what is proposed in this issue. The one external system that this input knows about is S3, and as far as I can tell, S3 can't be used for coordination because it lacks atomic operations that could make coordination possible.

At this time, I don't' have a solution, so this issue will wait until someone can come up with a solution that other S3 input users find agreeable.

If the sincedb path was added to a shared source would logstash honor that it would there be collisions?
I think if we can't do proper locking the only other option would be to use some other external source as sincedb to enable coordination mechanism. So whether for file or s3 inputs the external resource would handle managing which broker grabs which file.
Extending the sincedb logic would only suggestion I could think of to enable making inputs like this redundant across multiple brokers.
Seems like a large under taking if someone wants to suggest some plugable options (mongo, dynamo, db, etc)

The kinesis plugin uses a dynamodb row that just assigns different data streams to different logstash instances.

https://github.com/logstash-plugins/logstash-input-kinesis

This isn't the most efficient method, but for deployments with logstash-as-cattle, this type of implementation at least moves the plugin from: "we can't use it", to, "okay this will work".

@codekitchen @robbavey can anyone tackle adding the dynamodb alternative to sincedb to this input?

Yeah.. that would be awesome to change the sincedb to an "outside" system so that we can treat s3-input as cattle.

we've thought about utilizing EFS to achieve this, but for the time being, we settled on just boosting our logstash workers/mem/cpu/batch.size. But we need more.

UPDATE:
we will be trying out the logstash-input-s3-sns-sqs plugin for (if we can get it to work with out system) this should allow us to horizontally without worrying about the same files being processed per new container