suman6565/Spark-Tutorials

Following along with Jose Portilla on Udemy Course

Hadoop is method for distributing very large files across multiple machines
uses Hadoop Distributed File System (HDFS) to allow users to work with large data sets
Hadoop uses MapReduce to allow for computation across the distributed data set
HDFS uses blocks of data (default 128MB) that are replicated 3 times, distributed to support fault tolerance
smaller blocks provide more parralelization, multiple copies prevent loss of data
MapReduce splits computational task to distributed set of files
MapReduce consists ofJob Tracker and Task Tracker
Job Tracker to send code to run on Task Trackers
Task Trackers allocate CPU and Memory for the task and monitors the tasks on the worker node

Spark is one of latest frameworks to quickly and easily handle big data
first released February 2013, created at Berkley
written in Scala, so Scala normally gets the latest features
Scala is written in Java, so Java API normally does well too!
Python and R APIs are slowest to catch up
flexible alternative to MapReduce - i.e. it handles splitting of computational tasks across nodes

Hadoop and MapReduce are bound because MapReduce requires HDFS
Spark can perform operations up to 100X faster than MapReduce
Spark can work on HDFS, and other formats
MapReduce writes most data to disk, while Spark keeps it in RAM and spills over to disk only when necessary. This makes Spark faster!

Resilient Distributed Dataset (RDD)
- distributed collection of data
- fault tolerant
- parallel operation, partitioned
- ability to use many data sources
immutable
lazily evaluated
cacheable
even if working with DataFrames, they are still RDDs under the hood

local process is limited to computation resources on a single machine
distributed process process has access to computational resources across a number of machines connected through a network
after certain point, it is easier to scale out to many low cpu machnes than it is to scale up a single machine
distributed system is fault tolerant - if one machine fails, network still runs

suman6565 / Spark-Tutorials