rafomiya / semantix-big-data

A repository to keep my annotations on the course Big Data Foundations, from the Semantix Academy.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Big Data Engineer

This repository was made to document and store my annotations on the Semantix Big Data Engineer training.

Summary

  1. Big Data Foundations

    1. Hadoop
      1. Hadoop's structure
      2. Basic use of Docker
    2. HDFS
      1. HDFS commands
      2. Exercise: HDFS commands
    3. Hive
      1. Hive structure
      2. Exercise: Hive commands
      3. Exercise: HiveQL commands
      4. Partitioning
      5. Repair a table
      6. Exercise: partitioning
      7. Exercise: reading data with Hive
      8. Format and compression
      9. Format and compression in Hive
      10. Exercise: optimized tables
    4. Sqoop
      1. Introduction
      2. Exercise: create the databases
      3. Sqoop CLI commands
      4. Database connection
      5. Exercise: Sqoop commands
      6. Importing data
        1. Import commands
        2. Storing your data
        3. Defining custom delimiters
        4. Exercise: employees database
        5. Format and compression
        6. Exercise: optimize import
      7. Parallelism
      8. Jobs
      9. Incremental load
      10. Exercise: Sakilla database
      11. Sqoop & Hive
      12. Export data
      13. Exercise: Hive and exportations
    5. HBase
      1. Structure
      2. Commands
      3. Manipulate tables
      4. Insert and update data
      5. Read data
      6. Delete data
      7. Alter tables
      8. Exercise: CRUD
    6. Spark
      1. Introduction
      2. Dataframes
        1. Introduction
        2. Reading data
        3. Actions
        4. Transformations
        5. Exercise: Manage Dataframe
        6. Schemas
        7. Joins
        8. Exercise: schemas and joins
      3. SparkSQL
      4. API catalog
      5. Exercise: API catalog
      6. SparkSQL vs. Spark operations
      7. Exercise: SparkSQL vs. Spark operations
  2. MongoDB

    1. Introduction to NoSQL
    2. Introduction to MongoDB
    3. Basic commands
    4. Exercise: basic commands
    5. Making queries
      1. Basic query commands
      2. Exercise: basic query commands
      3. Other queries
      4. Exercise: other queries
    6. The update command
      1. Introduction
      2. Dates
      3. Arrays
      4. Exercise: the update command
    7. Deleting
    8. Exercise: CRUD
    9. Optimizing queries
      1. Indexes
      2. Querying with indexes
      3. Exercise: indexes
      4. Regex
      5. Exercise: regex
    10. Aggregations
      1. Introduction
      2. Exercise: aggregations I
      3. Joins
      4. Exercise: aggregations II
    11. Replica Set
    12. Shards
  3. Kafka

    1. Kakfa's basics
      1. Introduction
      2. Architecture
      3. Baisc commands
      4. Exercise: CLI
    2. The graphic interface
      1. Control Center
      2. Exercise: control center
    3. KSQL
      1. Introduction
      2. Stream operations
      3. Stream aggregations
      4. Exercise: KSQL
    4. Datagen
      1. KSQL Datagen
      2. Exercise: KSQL datagen
    5. Schema Registry
      1. Introduction
      2. The AVRO format
      3. AVRO Console Consumer
      4. AVRO Console Producer
      5. AVRO Stream
    6. Exercise: Kafka
  4. Redis

    1. Introduction to Redis
    2. The key-value structure
    3. Strings
    4. Exercise: strings
    5. Lists
    6. Exercise: lists
    7. Sets
    8. Exercise: sets
    9. Sorted sets
    10. Exercise: sorted sets
    11. Hashs
    12. Exercise: hashs
    13. Pub/sub
    14. Exercise: pub/sub
  5. Elastic

    1. Elastic's basics
      1. Introduction
      2. Communication
      3. Basic operations
      4. Exercise: basic operations
    2. Bulk
      1. Bulk API
      2. Exercises: Bulk API
    3. Queries
      1. Introduction
      2. Limit and pagination
      3. Exercises: limit and pagination
      4. Managing indexes
      5. Mapping
      6. Reindex
      7. Exercise: indexes
      8. Working with filters
      9. Bool queries
      10. Exercise: bool queries
      11. Search order
      12. Exercise: search order
      13. Range
      14. Time range
      15. Exercise: time range
    4. Analyzer
      1. Introduction
      2. Testing searches
  6. Spark

    1. Jupyter Notebook
    2. Spark session
    3. API Catalog
    4. Exercise: setup environment
    5. Exercise: setup Jupyter
    6. Reading a CSV
    7. RDD
      1. Introduction
      2. Data reading
      3. Transformations
        1. Map & Flatmap
        2. Filter & Reduce
        3. Sort
      4. Data visualization
      5. Exercise: RDD
      6. Partitions
      7. Exercise: partitions
    8. Schema Handling
      1. Creating
      2. Testing
      3. Exercise: schemas
    9. Datasets
      1. Basic concepts
      2. Creating datasets
      3. Transformations
      4. Exercise: datasets
    10. The withColumn command
      1. Introduction
      2. Timestamp functions
      3. Substring function
      4. Split function
      5. Exercise: functions I
      6. Cast, regex replace & when
      7. Aggregations
      8. Exercise: functions II
    11. Spark application
    12. Spark Streaming
      1. Basic concepts
      2. Reading data
      3. Exercise: reading DStream
      4. Operations
      5. Exercise: word count
    13. Spark Streaming with Kafka
      1. Integration
      2. Kafka revision
      3. Dependencies
      4. Code structure and imports
      5. Creating a DStream
      6. Exercise: Spark Streaming with Kafka I
      7. Exercise: Spark Streaming with Kafka II
    14. Structured Streaming
      1. Basic concepts
      2. Reading data
      3. Exercise: structured streaming
    15. Application Optimizations
      1. Shared variables
      2. User Defined Functions
      3. Tunning
    16. Structured Streaming with Kafka
      1. Basic concepts
      2. Reading and writing data
      3. Exercise: Structured Streaming with Kafka

About

A repository to keep my annotations on the course Big Data Foundations, from the Semantix Academy.