ljvmiranda921 / comments.ljvmiranda921.github.io

Blog comments for my personal blog: ljvmiranda921.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use Jupyter Notebooks in 2020 (Part 2: Ecosystem growth)

ljvmiranda921 opened this issue · comments

Comment written by Michael on 03/16/2020 20:55:41

Great posts! You might want to take another look at jupytext as I think it's a lot more flexible and powerful than nbconvert. When you use nbconvert it's a one-way process. You take a jupyter notebook and can convert it to text, html, markdown, etc.

Jupytext allows you to pair notebooks or scripts with other formats and you can edit in either format. You can pretty seamlessly switch back and forth between editing using a notebook interface and a full-fledged IDE like PyCharm or VS Code.

Comment written by Lj Miranda on 03/17/2020 02:57:30

Hi Michael, thanks a lot! Yeah I admit I haven't used jupytext a lot.
Thanks for the suggestion, it seems the jupytext really looks more powerful! Let me try it out in my next project :)

Comment written by Chris D on 02/08/2021 18:47:13

Thanks for this great analysis.

Can you please elaborate on this?

However, I highly recommend that for ETL workloads, [...] In most cases, converting Notebooks into Python scripts can incur less tech debt.


I'm asking because if most of my ETL has been done in a cloud database like Redshift, and only a few more filtering and shaping steps need to be done in a pySpark notebook in, for example, Sagemaker, then what tech debt cautions would you have in this case, where Airflow wouuld use Papermill to execute a simple pySpark job? This pySpark job would include some hyperparameters by which to shape and filter the data for iterative data science development. Is this a good use case for an Airflow + Papermill + Sagemaker data pipeline?

Comment written by Lj Miranda on 02/15/2021 05:04:01

Hi Chris D! Thanks for the insightful question!

If you think that the ETL pipeline can easily be converted into a python script (without depending on notebooks), then I'd advise anyone to do so. By sticking to the papermill approach, you might incur tech debt by:
* Introducing more dependencies (installing papermill, airflow for notebooks, bloating your machine/Docker image, etc.)
* Introducing more points of failure (maybe it's in substituting your parameters to the notebook, or the notebook doesn't exist, hard to diff, something went wrong when merging two branches, etc.)
* And a lot more.

Not sure if tech debt is the correct term, but I'd argue for more simplicity if it's easier to do!
Hope I answered your question!