Tiny demo project that demonstrates how to join streams of Kafka events using Apache Flink.
This is a solution to a question I have been using in interviews to test for distributed stream processing knowledge.
The question goes as follows:
Assume you have the following rudimentary data model. Assume that
User
andTweet
are keyed using their respective keys and stored in Kafka. Describe how you would implement ajoin
function to produceDataStream[TweetView]
from aDataStream[User]
and aDataStream[Tweet]
. Choose a streaming framework you are comfortable with. Kafka'sKStream
or Spark'sDStream
could work equally well.
case class User(userId: UserId, firstName: String, age: Int)
case class Tweet(tweetId: TweetId, author: UserId, content: String)
case class TweetView(tweetId: TweetId, tweet: Tweet, author: User)
val users: DataStream[User] = fromKafkaTopic("users")
val tweets: DataStream[Tweet] = fromKafkaTopic("tweets")
def join(users: DataStream[User], tweets: DataStream[Tweet]): DataStream[TweetView] = {
???
}
Any solution is fair game, assuming it takes into account users
and tweets
can arrive in any order.