Usefule Links:
- https://academy.databricks.com/exam/databricks-certified-associate-developer
- https://spark.apache.org/docs/latest/
Databricks
Reading:
In addition, Sections I, II, and IV of Spark: The Definitive Guide and Chapters 1-7 of Learning Spark should also be helpful in preparation.
Important Documentation
- scala
- https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html
- https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html
- https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html
- http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html
- http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html
- http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html
- http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html
- python
Languages:
- Python Basics: https://www.w3schools.com/python/python_intro.asp
- Scala Basics: https://www.scala-exercises.org/scala_tutorial/terms_and_types
certification notes https://github.com/bclipp/databricks_spark_cert_study_guide/blob/master/notes.md
Certification Videos:
- https://youtu.be/qEKfyoOUKb8
- https://www.youtube.com/watch?v=AoVmgzontXo
- https://www.youtube.com/watch?v=kkOG_aJ9KjQ
- https://www.youtube.com/watch?v=i7l3JQRx7Qw
- https://www.youtube.com/watch?v=f8j5t_xaly4
- https://www.youtube.com/watch?v=dmL0N3qfSc8
- https://www.youtube.com/watch?v=Ofk7G3GD9jk
- https://www.youtube.com/watch?v=_ArCesElWp8
- https://www.youtube.com/watch?v=YgQgJceojJY
- https://youtu.be/_VwyKCgG3mE
- https://youtu.be/iwQel6JHMpA
Spark 3.0 feature highlights:
- highlights: https://www.youtube.com/watch?v=f8j5t_xaly4&t=5s
- Adaptive query execution
- https://databricks.com/session_na20/adaptive-query-execution-speeding-up-spark-sql-at-runtime
- https://docs.databricks.com/spark/latest/spark-sql/aqe.html
- https://www.youtube.com/watch?v=jlr8_RpAGuU
- https://medium.com/agile-lab-engineering/spark-3-0-first-hands-on-approach-with-adaptive-query-execution-part-1-ff987f66b5c0
- Dynamic partition pruning :
- ANSI SQL compliance: https://towardsdatascience.com/spark-3-0-sql-feature-update-ansi-sql-compliance-store-assignment-policy-upgraded-query-94d8d8618ddf
- Sig python api improvements
- New UI for struct streaming
- 40 X speed up on R UDF
- Accelerator aware scheduler
- SQL Ref Documentation
- Kolas: https://www.youtube.com/watch?v=Ux1A8O6K2Xg&t=10s
Data:
- https://www.data.gov/open-gov/
- https://data.richmondgov.com/
- https://data.virginia.gov/
- https://github.com/awesomedata/awesome-public-datasets
- https://www.columnfivemedia.com/100-best-free-data-sources-infographic
Theory:
- window functions :
- caching https://towardsdatascience.com/best-practices-for-caching-in-spark-sql-b22fb0f02d34
- repartition vs coalescse: https://www.youtube.com/watch?v=pP-ohMzyFc4&list=RDCMUCoVVyUViJ3mfaEKVjAJSnVA&index=24
- tungsten: https://www.youtube.com/watch?v=mRf0GvNDlyc&list=RDCMUCoVVyUViJ3mfaEKVjAJSnVA&index=11
- partitioning and bucketing:
- spark UI: https://www.youtube.com/watch?v=rNpzrkB5KQQ
- Spark Architecture:
- Job Execution
- Dynamic Resource Allocation: https://www.youtube.com/watch?v=-9bh_Oue9GM
- Shuffling and joins
- Quickstart guide: https://spark.apache.org/docs/latest/quick-start.html
- Dataframes: https://spark.apache.org/docs/latest/sql-programming-guide.html
- Cluster mode: https://spark.apache.org/docs/latest/cluster-overview.html
- Spark session: https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
- Scala http://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html
- SparkSQL http://spark.apache.org/docs/latest/api/sql/index.html
- Adaptive Query Execution:
- Spark DAG: https://data-flair.training/blogs/dag-in-apache-spark/
- Catalyst, execution plans:
- Caching data
- UDF (Python, Scala and SQL)
- Shuffling:
- https://sparkbyexamples.com/spark/spark-shuffle-partitions/#:~:text=The%20Spark%20SQL%20shuffle%20is,worker%20nodes%20in%20a%20cluster.
- https://xuechendi.github.io/2019/04/15/Spark-Shuffle-and-Spill-Explained
- https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
- http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html
- https://www.linkedin.com/pulse/spark-sql-3-common-joins-explained-ram-ghadiyaram/
- Cluster: https://spark.apache.org/docs/latest/cluster-overview.html
- Sub app: https://spark.apache.org/docs/latest/submitting-applications.html
- Performance tuning:
- Catalyst, logical and physical plan: https://medium.com/@Shkha_24/catalyst-optimizer-the-power-of-spark-sql-cad8af46097f
- Old exam notes: https://github.com/vivek-bombatkar/Databricks-Apache-Spark-2X-Certified-Developer#a
- joining dataframes https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html
None cert Videos:
- (delta Lake and Koalas) https://www.youtube.com/watch?v=scM_WQMhB3A
- (structed streaming) https://www.youtube.com/watch?v=rl8dIzTpxrI
- (databricks secrets) https://www.youtube.com/watch?v=HZ00AznWvKc
- (parquet in depth) https://www.youtube.com/watch?v=_0Wpwj_gvzg
- (troubleshooting spark)https://www.youtube.com/watch?v=s5p15QT0Zj8
- (Out of Mem) https://www.youtube.com/watch?v=FdT5o7M35kU
- (hashpartition vs range partitioning) https://www.youtube.com/watch?v=BvyOJuik8FA Coursera course(ignore ML content): https://www.coursera.org/learn/spark-sql
Useful links: Old many topics are not covered anymore: https://towardsdatascience.com/my-10-recommendations-after-getting-the-databricks-certification-for-apache-spark-53cd3690073
Spark SQL module, Documentation:
- https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
- https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/index.html
- https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html
- https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html
- https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html
- https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html
- https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html
API: