TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Example to ingest the data over REST

tsk70 opened this issue · comments

commented

Is there any example available to ingest the data over REST?

I think the answer is no, though I think setting up a REST endpoint is probably the harder part of the example. Kafka (or something like it) is probably the more natural way to provide data at the moment, where timely reads from sources of input (e.g. local files). Kafka has a REST interface, so if you use it as the intermediary you could write at a Kakfa queue that has been connected to a timely dataflow computation.

Hi @tsk70! We are actively working on adding REST endpoints and quality-of-life improvements for interfacing with timely and differential at this moment. We'll ping this issue when we have something that you can try out.

If you'd like an example of reading data out of Kafka, the kafkaesque subproject previously only had examples for serialized timely dataflow data. I've added a fairly simple method for a Kafka topic and an arbitrary "from bytes" user-supplied closure. It lacks a great many things (e.g. dealing with partitioned streams, committing offsets, other things that you would want in a real system).

https://github.com/TimelyDataflow/timely-dataflow/blob/master/kafkaesque/src/kafka_source.rs

commented

Thanks @rjnn and @frankmcsherry for the quick response. I have tried to build the sub project and got the below error.

thread 'rustc' panicked at 'index out of bounds: the len is 4 but the index is 7', /rustc/146aa60f3484d8267e085e80611969f387eca068/src/libcore/slice/mod.rs:2545:14
note: Run with RUST_BACKTRACE=1 environment variable to display a backtrace.

error: internal compiler error: unexpected panic

note: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report: https://github.com/rust-lang/rust/blob/master/CONTRIBUTING.md#bug-reports

note: rustc 1.34.0-nightly (146aa60f3 2019-02-18) running on x86_64-apple-darwin

note: compiler flags: -C debuginfo=2 --crate-type lib
note: some of the compiler flags provided by cargo are hidden

error: Could not compile rdkafka.
warning: build failed, waiting for other jobs to finish...
error: build failed

That looks fun!

I recommend using the stable compiler for the moment (the error above says that you are using the nightly compiler), unless you want to help the Rust folks out with debugging their compiler issues (a nice thing to do, but maybe not the first thing to get working). I can try and repro and file an issue on the rustc repository for you, but for me at the moment the stable rust compiler is building things just fine. Let me know if that works for you!

Yup, looks like currently nightly Rust crashes building rdkafka. Current stable release has no problem, though, at least on my machine. @tsk70 would you like the honor of reporting this as an issue on the Rust repo, or should I do that?

commented

Thanks Frank. Current stable release works fine. I will log the issue on the rust repo.

commented

Hi @frankmcsherry,
Do you know when will the support for partitioned streams, committing offsets etc. be available in kafkaesque? If there is no ETA available, then what’s your recommendation for listening live stream. Can we use TCPListener instead? If answer is yes, how does it work in a cluster environment since each worker will listen to different port. I am assuming that client (or some proxy) has to write it into different port based on the number of workers and sharding key.

Hello!

I think there is no ETA on adding Kafka features. If you have specific problems, we could look in to supporting solutions for them.

Kafka partitioned streams should "work" in that (as I understand it) rdkafka will round-robin the parts among the consumers. What doesn't "work" is that Kafka doesn't propagate watermark information, and as soon as a stream is partitioned it is no longer a "sequence" but rather a set of sequences and it gets a bit harder to have each source understand what the least timestamps are. What timely does here when it uses Kafka is to manually partition topics into streams topic-1, topic-2, ... which isn't great, but at least allows the consumers to provide correct progress information. You can see this in the capture-send.rs example.

My guess is that there is no ETA on committing offsets; this relies on the durability properties of the timely dataflow computation between the Kafka queues, and if you have a program which must maintain persistent state but doesn't, you shouldn't commit the offsets until the state is persisted.

As far as recommendations, it depends on where you are getting your data from and what information it has in it. If you have a sequence of in-order (by timestamp) data I would probably put it in a single Kafka topic (and advance input capabilities as records move past). If you want to partition the stream, make sure each part gets watermark/progress information, and I would personally use several single-part topics (per the example above). You can also use TcpListeners, which is how the capture_{send,recv}.rs examples in the timely repository work.

Does this help?

commented

Thanks for the quick response.
We get some of the data comes from Kafka but events may come out of order. In most cases the topic with multiple partition exist and don't have much control on the producer side.
But some of the data comes from other system where the external java client can push the data to long running timely dataflow. Do you think TcpListeners will be a good fit in this case?
How do we make the capture_recv as long running process and keeps listening infinitely to the income data? Loop or stdin after the worker.dataflow is not working. Also I need to push the data from non-timely code instead of using capture_send code.
The working sample will be great.

The problem I see is that with a multiple part Kafka stream, especially when data come out of order, someone will need to determine when it is no longer the case that you'll see certain timestamps again. At least, this information is needed if you want to communicate it to timely operators that may want to block for complete data. You could invent some rules, like perhaps that data are no more than 10 seconds out of order, which would allow you to downgrade the source capability to time - 10s whenever you see time. But, this would just be a guess without further information from the input source.

A TcpListener should work fine, though at that point I might consider not using replay_into, as it requires a certain format. Instead, you can write a source operator that polls its TcpStream for new data (a bit like the Kafka example operator does) and produces output and downgrades its capabilities as appropriate.