commit-live-students / big_data_hadoop_in_class

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GitHub Logo

Big Data and Hadoop

Large and complex data, difficult to process them using traditional data processing applications as it is computationally difficult to reveal patterns, trends, and associations, especially relating to human behavior and interactions. Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model.

At a glance

  • In Class Instruction: 4 Hours
    • In Class code along Dataset: war_and_peace

In Class Activity

  • Installation of Hadoop
  • Hands-on exercise with dataset

Pre Reads

  1. History of how big data actually evolved
  2. Big Data Story Map
  3. Why Big Data Matters

Learning Objectives

  • Understand the motivation for Big Data
  • Understand the storage layer underlying Big Data - HDFS
  • Store and retrieve data in HDFS

Agenda

  • Big Data Motivation
  • Introduction to Hadoop & Ecosystem
  • Setup CLoudera environment
  • Interaction with HDFS

Slides

Big Data and Hadoop Introduction

Post Reads

  1. Comparing the top Hadoop distributions
  2. Original mapreduce paper
  3. How google uses Big Data
  4. How CERN uses Big Data

About