caioquirino / full-stack-big-data

Full stack big data demo with Play Framework, Akka, Akka Streaming, HDFS, batch layer, and couchDB for querying. Provides simple real time and batch word count and calculates the difference between them (in case of data loss) and stores the result for querying.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

full-stack-big-data

Basic tools and proof of concept functionality:

  • Full stack big data demo with Play Framework, Akka, Kafka, Akka Streaming, HDFS, batch layer, and CouchDB for querying.

  • Provides simple real time and batch word count.

  • Calculates the difference between real time and batch (in case of data loss) and stores the result for querying.

Purpose:

"The morale effects are startling. Enthusiasm jumps when there is a running system, even a simple one. Efforts redouble when the first picture from a new graphics software system appears on the screen, even if it is only a rectangle. One always has, at every stage in the process, a working system. I find that teams can grow much more complex entities in four months than they can build."

—FREDERICK P. BROOKS, JR., The Mythical Man-Month

  • Minimum viable working app
  • Leaves plenty of space for "filling in the working sub-sections later"

Architectural Tenents

  • "lightweight" - minimize and remove unnecessary extra stuff
  • "distributed" - seperate pieces which talk to each other can survive if another piece dies.
  • "open source" - should be possible to have community of sites like Scala Reddit and Stack Overflow look at code, report problems or vulnerabilities.
  • "real time" - should be able to provide immediate results in addition to batch.

Architecture:

Data-centric view:

input: String, output: (timestamp, Int), where Int is a wordcount

GUI-centric view:

Input Text
text
...
...

^ Submit ^

A simple text box with a submit button. Clicking "Submit" brings the user to a query menu.

Time interval: _____ to ______ | Submit |

The user enters the time interval for the query. Data is pulled from the backend to produce a result. The user can compare the "speed" result with the actual result produced from the query:

Count from web framework: X
Count from streaming layer: Y
Count from batch layer: Z

The user can then compare the counts produced by the different layers.

URL-centric view:

/* This gets the page with the text box */

  • get /home

/* This sends the submitted string to the word count page, changing the state of the system in the backend */

  • post /home

/* This gets the page with the time interval query */

  • get /home/time/

/* This sends startTime and endTime to the time interval query, getting the counts for that time interval */

  • get /home/time/start="1:10:2"&end="1:10:5"

Compilation (not yet available):

Project is broken up into seperately compilable compilation units. Each component is its own compilation unit with two "super units" - ingestion and processing. Ideas for compilation units include web server compilation unit [handles requests], reactive kafka compilation unit [handles relaying], and batch processing compilation unit.

According to the book "Mythical Man Month", teams tend to split according to the boundaries of the architecture, so teams can naturally divvy up based on independent components that they work on.

About

Full stack big data demo with Play Framework, Akka, Akka Streaming, HDFS, batch layer, and couchDB for querying. Provides simple real time and batch word count and calculates the difference between them (in case of data loss) and stores the result for querying.

License:Other


Languages

Language:Scala 85.1%Language:Shell 14.9%