rberenguel / pyspark-arrow-pandas

Presentation about Pyspark and how Arrow makes it faster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Internals of Speeding Up Pyspark with Arrow

Presentation I (@berenguel) gave at the PyBCN meetup on June 2018, Spark London on September 2018, Spark Barcelona and Spark Summit Europe 2019 to explain how Spark 2.3/2.4 has optimised UDFs for Pandas use as well as how PySpark works. A recording of this talk (the one given in Python Barcelona, in English) is available here, the recording from Spark Summit is available here. You can find the slides here (some images might look slightly blurry). I recommend you check the version with presenter notes which is only available here.

If you want additional information about Spark in general, I gave an introduction to Spark talk with Carlos Peña that you can find here.


This presentation is formatted in Markdown and prepared to be used with Deckset. The drawings were done on an iPad Pro using Procreate. Here only the final PDF and the source Markdown are available. Sadly the animated gifs are just static images in the PDF.


You can find an exported version using reveal.js of the version given at Spark Summit here. It is not 100% faithful to the PDF/Deckset version but is close enough (and animated gifs play). The export was generated with this and tweaked to add a footer.


Buy Me A Coffee


About

Presentation about Pyspark and how Arrow makes it faster


Languages

Language:HTML 100.0%