learn-co-students / dsc-spark-section-recap-nyc-ds-082619

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Apache Spark - Recap

Key Takeaways

The key takeaways from this section include:

  • Big Data usually refers to datasets that grow so large that they become awkward to work with using traditional database management systems and analytical approaches
  • Big data refers to data that is terabytes (TB) to petabytes (PB) in size
  • MapReduce can be used to split big datasets up in smaller sets to be distributed over several machines to deal with Big Data Analytics
  • Before starting to work, you need to install Docker and Kinematic on your environment
  • Make sure to test your installation so you're sure everything is working
  • When you start working with PySpark, you have to create a SparkContext()
  • The creation or RDDs is essential when working with PySpark
  • Examples of actions and transformations include collect(), count(), filter(), first(), take(), and reduce()
  • Machine Learning on the scale of big data can be done with Spark using the ml library

About

License:Other


Languages

Language:Jupyter Notebook 100.0%