jupyter-incubator / sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] EMR Studio Jupyter Notebook using PySpark kernel references an old version of pip

rankol opened this issue · comments

Describe the bug

I am using a Jupyter Notebook which is provided by an AWS managed service called EMR Studio. My understanding of how these notebooks work is that they are hosted on EC2 instances that I provision as part of my EMR cluster. Specifically with the PySpark kernel using the task nodes.

Currently when I run the command sc.list_packages() I see that pip is at version 9.0.1 whereas if I SSH onto the master node and run pip list I see that pip is at version 20.2.2. I have issues running the command sc.install_pypi_package() due to the lowered pip version in the Notebook.

In the notebook cell if I run import pip then pip I see that the module is located at

<module 'pip' from '/mnt1/yarn/usercache/<LIVY_IMPERSONATION_ROLE>/appcache/application_1652110228490_0001/container_1652110228490_0001_01_000001/tmp/1652113783466-0/lib/python3.7/site-packages/pip/__init__.py'> 

I am assuming this is most likely within a virtualenv of some sort running as an application on the task node? I am unsure of this and I have no concrete evidence of how the virtualenv is provisioned if there is one.

If I run sc.uninstall_package('pip') then sc.list_packages() I see pip at a version of 20.2.2 which is what I am looking to initially start off with. The module path is the same as previously mentioned.

How can I get pip 20.2.2 in the virtualenv instead of pip 9.0.1?

If I import a package like numpy I see that the module is located at a different location from where pip is. Any reason for this?

<module 'numpy' from '/usr/local/lib64/python3.7/site-packages/numpy/__init__.py'>

As for pip 9.0.1 the only reference I can find at the moment is in /lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl. One directory outside of this I see a file called virtualenv-15.1.0-py2.7.egg-info which if I cat the file states that it upgrades to pip 9.0.1. I have tried to remove the pip 9.0.1 wheel file and replaced it with a pip 20.2.2 wheel which caused issues with the PySpark kernel being able to provision properly. There is also a virtualenv.py file which does reference a __version__ = "15.1.0".

Expected behavior
I am expecting the Jupyter notebook to use an updated version of pip that is on the system that the notebook is running on but instead it uses an outdated version of pip.

Screenshots
If applicable, add screenshots to help explain your problem.

Versions:

  • SparkMagic (unsure of how to find this but I do know commands like %%info and %%configure work in the notebook cells)
  • Livy (if you know it) (unsure of how to find this)
  • Spark = version 3.1.2-amzn-0

Additional context
Add any other context about the problem here.

Hi @rankol thanks for making an issue! This is an issue with AWS EMR Studio and you'll have debug with AWS support. Please comment any resolutions on this issue for future reference 🙌

Hi, I was able to resolve this.

https://stackoverflow.com/a/72453605/13291468