How to use Jupyter Notebooks in 2020 (Part 1: The data science landscape)

Question

How to use Jupyter Notebooks in 2020 (Part 1: The data science landscape)

ljvmiranda921 opened this issue 3 years ago · comments

Lj Miranda commented 3 years ago

Written on 03/08/2020 10:05:41

URL: https://ljvmiranda921.github.io/notebook/2020/03/06/jupyter-notebooks-in-2020/

Lj Miranda · Answer 1 · Fri Mar 26 2021 14:18:56 GMT+0800 (China Standard Time)

Comment written by Tudor Lapusan on 03/11/2020 08:42:11

Hi Lj,
well organized article !

To scale our data processing, we can use also Spark from Jupyter notebooks. There is also an early project, Koalas, which allows to run Spark jobs using Pandas API, https://github.com/databric....

I would like to see more support from version control tools in the future, right now is a real challenge/impossible to make a code review on a notebook or to resolve a merge conflict.

When I'm working on a python data project (data analysis, experiments, ML), which in the end will need to be deployed in production, I like to combine both Jupyter notebooks and Python modules (in a Pycharm project). I write code as much as possible as functions in python modules. Then, I'm importing the modules in notebooks and use the functions. In this way, the main code is already in python modules, which are more ready for deployment.

Lj Miranda · Answer 2 · Fri Mar 26 2021 14:19:03 GMT+0800 (China Standard Time)

Comment written by Robert Lacok on 03/12/2020 15:15:38

Nice blog post Lj! I found saw it referenced in the Tracking Jupyter newsletter.
I totally agree with the 3 directions you described.

Regarding the support for dev workflow: do you think the core reason, why today's experiments code cannot easily be reused in production, is that the tools aren't mature enough yet?
How can the tools better support this transition?

Lj Miranda · Answer 3 · Fri Mar 26 2021 14:19:09 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 03/12/2020 16:19:08

Hi Tudor,

Thank you so much!

That's interesting, on our end we use a lot of Dataflow and BigQuery (all Google Cloud Platform products). Personally, if I want to run some jobs through my notebooks, I'd parametrize them with papermill and run it on a server (or a kubernetes cluster).

I'll be talking more about my workflow in my second post, but I also do the same thing: put them into modules so that I can reuse them anywhere. What I found also useful is a standard project structure that all members of the data science team will adhere to. On my end, cookiecutter-datascience works right out of the box.

For merging and diffing, I recommend nbdime/nbstripout! For version control, I'd recommend DVC! Stay tuned! I'll try to talk about them in depth!

Lj Miranda · Answer 4 · Fri Mar 26 2021 14:19:15 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 03/12/2020 16:29:20

Hi Robert, thank you so much!

If maturity is defined as a stable codebase, many prove use-cases, etc. I think we're starting to see production tools for notebooks: Netflix's use of papermill comes to mind. In addition, we also have things like dagster for an Airflow-like experience with Notebooks. I think tools are just one part of the solution. There's also a bit of culture involved...

I think it's fair to say that Notebooks aren't well-loved, especially for people coming from a software engineering background (not everyone!). If within a company, internal teams can build production/platform support for notebooks, then we can see more and more use-cases for production Notebooks and create a virtuous cycle to fuel the transition to Prod.

I guess that's it, I feel that we just need more successful use-cases to help in the transition. As for tools, we can just keep building stuff to fill-in the loose spaces that are still missing.

Hope it helps! Planning to write about this as well in Part 3!

Lj Miranda · Answer 5 · Fri Mar 26 2021 14:19:21 GMT+0800 (China Standard Time)

Comment written by Sue Werner on 03/13/2020 00:37:43

oh that's great, i was hoping that you'd discuss merging / diffing / version control as it's become more and more of an issue even in my (very small, research oriented) team. great post.

Lj Miranda · Answer 6 · Fri Mar 26 2021 14:19:27 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 03/16/2020 01:42:12

Hi Sue Werner, I've touched upon how to incorporate Notebooks in the developer workflow in my second post. You can find it here: https://ljvmiranda921.githu...

Lj Miranda · Answer 7 · Fri Mar 26 2021 14:19:34 GMT+0800 (China Standard Time)

Comment written by Artem Aleksieiev on 03/22/2020 20:26:56

Hi Lj!
Thank you for first part!
I have a beginner question. Currently I started to study machine learning and try to build and deploy different ML models using AWS. But I found that using models that way (using endpoints in flask app) is a not free service. What is the correct technique for using ML model to work with it through web?

Lj Miranda · Answer 8 · Fri Mar 26 2021 14:19:40 GMT+0800 (China Standard Time)

Comment written by Artem Aleksieiev on 03/23/2020 17:07:03

Hello

Lj Miranda · Answer 9 · Fri Mar 26 2021 14:19:46 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 04/26/2020 23:25:17

Hi Artem! I think you're on the right track to write a Flask application and deploy them via AWS. Deploying to these cloud providers is unfortunately not free, but you can probably reduce costs by trying out serverless architectures (pay only when someone requests, no need to maintain a VM running all the time). In AWS you have Lambda and Fargate, in GCP you have Cloud Run and Functions.

Of course, you also need to check how big your model is so that these services can accommodate them, but atleast that's a start!

Lj Miranda · Answer 10 · Fri Mar 26 2021 14:19:52 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 04/26/2020 23:26:11

Hi! I think your previous question got stuck in Disqus Spam so I wasn't able to get to it right away. I just replied now (look below) :)