badkoubeh / sfucmpt733

SFU CMPT 733 public repo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to Build a Distributed Big Data Pipeline Using Kafka and Docker

Quickstart instructions

Please run all these commands inside foobar-kafka directory

#Create docker networks

$ docker network create kafka-network                         # create a new docker network for kafka cluster (zookeeper, broker, kafka-manager services, and kafka connect sink services)
$ docker network create cassandra-network                     # create a new docker network for cassandra. (kafka connect will exist on this network as well in addition to kafka-network)

Starting Cassandra

Cassandra is setup so it runs keyspace and schema creation scripts at first setup so it is ready to use.

$ docker-compose -f cassandra/docker-compose.yml up -d

Starting kafka on docker

$ docker-compose -f kafka/docker-compose.yml up -d            # start single zookeeper, broker, kafka-manager and kafka-connect services
$ docker ps -a                                                # sanity check to make sure services are up: kafka_broker_1, kafka-manager, zookeeper, kafka-connect service

Note: Kafka front end is available at http://localhost:9000

Kafka-Connect REST interface is available at http://localhost:8083

Starting Producers

$ docker-compose -f owm-producer/docker-compose.yml up -d     # start the producer that retrieves open weather map
$ docker-compose -f twitter-producer/docker-compose.yml up -d # start the producer for twitter

Starting Twitter classifier (plus Weather consumer for CSV dumps)

(Alternatively you can build first, docker will do it anyway if it hasnt built it before, this step is required to apply new changes)

$ docker-compose -f consumers/docker-compose.yml build

Start consumers:

$ docker-compose -f consumers/docker-compose.yml up -d        # start the consumers

Check all containers are running with

$ docker ps -a                                                # sanity check to make sure services are up: kafka_broker_1, kafka-manager, zookeeper, kafka-connect service

Teardown

To stop all running kakfa cluster services

$ docker-compose -f consumers/docker-compose.yml down          # stop the consumers

$ docker-compose -f owm-producer/docker-compose.yml down       # stop open weather map producer

$ docker-compose -f twitter-producer/docker-compose.yml down   # stop twitter producer

$ docker-compose -f kafka/docker-compose.yml down              # stop zookeeper, broker, kafka-manager and kafka-connect services

$ docker-compose -f cassandra/docker-compose.yml down          # stop Cassandra

To remove the kafka-network network:

$ docker network rm kafka-network
$ docker network rm cassandra-network

Check that data is arriving to Cassandra

First login into Cassandra's container with the following command or open a new CLI from Docker Desktop if you use that.

$ docker exec -it cassandra bash

Once loged in, bring up cqlsh with this command and query twitterdata and weatherreport tables like this:

$ cqlsh --cqlversion=3.4.4 127.0.0.1 #make sure you use the correct cqlversion

cqlsh> use kafkapipeline; #keyspace name

cqlsh:kafkapipeline> select * from twitterdata;

cqlsh:kafkapipeline> select * from weatherreport;

And that's it! you should be seeing records coming in to Cassandra. Feel free to play around with it by bringing down containers and then up again to see the magic of fault tolerance!

Load data utility

To load backup CSV files into Cassandra, from a console run the following:

$ python consumers/python/cassandrautils.py twitter {PATH_TO_twitter.csv}
$ python consumers/python/cassandrautils.py weather {PATH_TO_weather.csv}

FAQs

How can I connect to a running container?

docker exec -it <container_name>

About

SFU CMPT 733 public repo


Languages

Language:Jupyter Notebook 95.3%Language:Python 3.1%Language:Shell 1.3%Language:Dockerfile 0.2%Language:TeX 0.0%