nteract / papermill

πŸ“š Parameterize, execute, and analyze notebooks

Home Page:http://papermill.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SparkMagic pyspark kernel magic(%%sql) hangs when running with Papermill.

edwardps opened this issue Β· comments

πŸ› Bug

Hi,

Issue

Our use case is to use SparkMagic wrapper kernels with PaperMill job.
Most of the functions are working as expected except the %%sql magic which will get stuck during execution. The SparkMagic works properly when executed in interactive mode in JupyterLab and issue only happens for %%sql magic when running with PaperMill.

From the debugging log(attached), I can see the %%sql logic had been executed and response was retrieved back. The execution state was back to idle at the end. But the output of %%sql cell was not updated properly and the following cells were not executed.

Following content was printed by PaperMill, which shows the %%sql has been executed properly. This content was not rendered into cell output.

msg_type: display_data
content: {'data': {'text/plain': '<IPython.core.display.HTML object>', 'text/html': '

\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
databasetableNameisTemporary
0defaultmovie_reviewsFalse
\n
'}, 'metadata': {}, 'transient': {}}

Output notebook of papermill:
image

Expected output(from JupyterLab)
image

Reproducing steps

conda create --name py310 python=3.10
conda activate pyenv310

pip install sparkmagic
pip install papermill

# install kernelspecs
SITE_PACKAGES_LOC=$(pip show sparkmagic | grep Location | awk '{print $2}')
cd $SITE_PACKAGES_LOC
jupyter-kernelspec install sparkmagic/kernels/sparkkernel --user
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel --user
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel --user

jupyter nbextension enable --py --sys-prefix widgetsnbextension 

pip install notebook==6.5.1 (Downgrade rom 7.0.3 to 6.5.1 due to ModuleNotFoundError: No module named 'notebook.utils')

# Run papermill job(notebook is also uploaded)
# Before run this, an EMR cluster is needed and the sparkmagic configure is also needed. 
# If it's not possible/easy to create it, please comment for any testing/verification needed, I can help. Also, you can check the uploaded the papermill debugging log. 
papermill pm_sparkmagic_test.ipynb output1.ipynb --kernel pysparkkernel  --log-level DEBUG 

Following is package list which might be highly related. I also attached one text contains all the packages.

pip list | grep 'papermill\|sparkmagic\|autovizwidget\|hdijupyterutils\|ipykernel\|ipython\|ipywidgets\|mock\|nest-asyncio\|nose\|notebook\|numpy\|pandas\|requests\|requests-kerberos\|tornado\|ansiwrap\|click\|entrypoints\|nbclient\|nbformat\|pyyaml\|requests\|tenacity\|tqdm\|jupyter\|ipython'|sort
ansiwrap                  0.8.4
autovizwidget             0.20.5
click                     8.1.7
entrypoints               0.4
hdijupyterutils           0.20.5
ipykernel                 6.25.2
ipython                   8.15.0
ipython-genutils          0.2.0
ipywidgets                8.1.0
jupyter                   1.0.0
jupyter_client            8.3.1
jupyter-console           6.6.3
jupyter_core              5.3.1
jupyter-events            0.7.0
jupyterlab                4.0.5
jupyterlab-pygments       0.2.2
jupyterlab_server         2.24.0
jupyterlab-widgets        3.0.8
jupyter-lsp               2.2.0
jupyter_server            2.7.3
jupyter_server_terminals  0.4.4
nbclient                  0.8.0
nbformat                  5.9.2
nest-asyncio              1.5.5
notebook                  6.5.1
notebook_shim             0.2.3
numpy                     1.25.2
pandas                    1.5.3
papermill                 2.4.0
requests                  2.31.0
requests-kerberos         0.14.0
sparkmagic                0.20.5
tenacity                  8.2.3
tornado                   6.3.3
tqdm                      4.66.1

log and other files.zip contains:

  1. log - papermill debugging log
  2. my_test_env_requirements.txt - full list of package in the conda env
  3. pm_sparkmagic_test.ipynb - the notebook executed in jupyterlab and it's also the input of papermill job
  4. output1.ipynb - output notebook from the papermill job

I am not very sure if this is a issue of papermill or sparkmagic. I will also copy this issue to SparkMagic to see if there happend to be any expert who can provide advice. If it's confirmed it's caused by SparkMagic, please feel free to close this issue.
Thanks in advance.

Add more details.
According to SparkMagic owner, seems this is caused by the asynchronous nature of the execute method in the SQLQuery class.

There is some troubleshooting ongoing. FYI
jupyter-incubator/sparkmagic#833 (comment)