ber2 / pyday2020-counting-votes-with-dask

Notebooks and code for my talk at the Bcn PyDay 2020

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Counting votes with Dask

Notebooks and code for my talk at the Barcelona PyDay, December 12th, 2020.

The topic of the talk is how Dask does an excellent job at analyzing datasets that do not fit on single-machine memory while introducing little overhead and staying in the single-machine context. We have shown an example built around counting votes in order to find out the winner of an election.

Disclaimer: all of the data shown in the talk has been artificially generated and any relationship between the election example portrayed in the talk and recent events in a faraway country is pure coincidence.

Regarding Dask

Dask is a project with excellent documentation. Since we have focused on a simple use case for the purpose of the talk, many of the nuances about how Dask works have only been sketched.

If you wish to play more with it, I highly recommend checking the documentation. The section on use cases is a good showcase of what is possible.

In particular, Dask can be set up either on a single machine or on a cluster. We have focused on the first case and have used the Dask JupyterLab Extension for the management of the scheduler. I recommend checking the documentation on the single-machine scheduler for more details on ways to configure a single-machine cluster.

Video for the talk

The video for the talk is available here.

Slides for the talk

I used beautiful.ai to craft the slides; they are available here.

Doing this at home

Data for the examples

Since the data for the examples is artificially generated, I have deemed it unnecessary to publish a dataset weighing a few gigabytes with zero interest beyond this talk.

Instead, I have made the code that generates the dataset available. Please have a look and feel free to use it in order to generate data for your own purposes:

Installing Dask and the JupyterLab extension

On conda

As explained during the talk, Dask ships with anaconda. In order to replicate the setup in the talk, first install JupyterLab:

conda install jupyterlab nodejs

Then, install the Dask lab extension:

conda install -c conda-forge dask-labextension

Finally, build the extension

jupyter labextension install dask-labextension
jupyter serverextension enable dask_labextension

Using poetry

I have provided the bare minimum necessary to run the examples on a fresh Python 3.8 virtual environment using pyenv and poetry.

If you have these tools installed, set up a virtual env running Python 3.8 and, on the root of this repository, execute the following:

pip update -U pip
poetry install
jupyter labextension install dask-labextension
jupyter serverextension enable dask_labextension

Notebooks

They are the four ipynb files at the root of this repository, and they are numbered according to their exposition order.

About

Notebooks and code for my talk at the Bcn PyDay 2020

License:MIT License


Languages

Language:Jupyter Notebook 100.0%