Cadair / parfive

An asyncio based parallel file downloader for Python 3.8+

Home Page:https://parfive.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Downloading data with parfive>=2 inside a streamlit app fails with a RuntimeError

AthKouloumvakos opened this issue · comments

Data download with parfive>=2.0.0 breaks inside streamlit app with the following error RuntimeError: set_wakeup_fd only works in main thread of the main interpreter (see full error in AthKouloumvakos/PyThea#13 (comment))

So how to reproduce this in a minimal example:

I make a fresh environment:

conda create --name test python=3.9
conda activate test
pip install "sunpy[all]" streamlit 

Make a test.py with the:

import streamlit as st
import astropy.units as u

from sunpy.net import Fido
from sunpy.net import attrs as a

def download_data(attrs_time):
    result = Fido.search(attrs_time, a.Instrument.eit)
    st.write(result)
    downloaded_files = Fido.fetch(result)
    return downloaded_files

st.title('Download some fits files app')

if st.button('Download with Fido'):
    downloaded_files = download_data(a.Time('2005/01/01 00:10', '2005/01/01 00:15'))
    st.write(downloaded_files)

and run this from the command line with streamlit run test.py

What the above script does is to load the streamlit app in the browser and when the 'Download with Fido' button is pressed, downloads some fits files using sunpy's FIDO.

This is when the error RuntimeError: set_wakeup_fd only works in main thread of the main interpreter comes in.

The full error is:

RuntimeError: set_wakeup_fd only works in main thread of the main interpreter
File "/Users/opt/anaconda3/envs/test/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
File "/Users/Desktop/myPython/test.py", line 20, in <module>
    downloaded_files = download_data(a.Time('2005/01/01 00:10', '2005/01/01 00:15'))
File "/Users/Desktop/myPython/test.py", line 14, in download_data
    downloaded_files = Fido.fetch(result)
File "/Users/opt/anaconda3/envs/test/lib/python3.9/site-packages/sunpy/net/fido_factory.py", line 438, in fetch
    results = downloader.download()
File "/Users/opt/anaconda3/envs/test/lib/python3.9/site-packages/parfive/downloader.py", line 330, in download
    return self._run_in_loop(self.run_download())
File "/Users/opt/anaconda3/envs/test/lib/python3.9/site-packages/parfive/downloader.py", line 249, in _run_in_loop
    self._add_shutdown_signals(loop, task)
File "/Users/opt/anaconda3/envs/test/lib/python3.9/site-packages/parfive/downloader.py", line 226, in _add_shutdown_signals
    loop.add_signal_handler(sig, task.cancel)
File "/Users/opt/anaconda3/envs/test/lib/python3.9/asyncio/unix_events.py", line 97, in add_signal_handler
    raise RuntimeError(str(exc))

It seems that loop.add_signal_handler(sig, task.cancel) somewhere fails. This happens only when I use this inside streamlit.

When trying to pass by this problem like this:

try:
    loop.add_signal_handler(sig, task.cancel)
except RuntimeError:
    return

the fits files download normally.

My sunpy.system_info() are:

General
#######
OS: Mac OS 13.3.1
Arch: 64bit, (i386)
sunpy: 4.1.5
Installation path: /Users/opt/anaconda3/envs/test/lib/python3.9/site-packages/sunpy-4.1.5.dist-info

Required Dependencies
#####################
astropy: 5.2.2
numpy: 1.24.3
packaging: 23.1
parfive: 2.0.2

Optional Dependencies
#####################
asdf: 2.15.0
asdf-astropy: 0.4.0
beautifulsoup4: 4.12.2
cdflib: 0.4.9
dask: 2023.5.0
drms: 0.6.3
glymur: 0.12.5
h5netcdf: 1.1.0
h5py: 3.8.0
lxml: 4.9.2
matplotlib: 3.7.1
mpl-animators: 1.1.0
pandas: 2.0.1
python-dateutil: 2.8.2
reproject: 0.10.0
scikit-image: 0.20.0
scipy: 1.9.1
sqlalchemy: 2.0.13
tqdm: 4.65.0
zeep: 4.2.1

Without looking into it too much, I would assume streamlit is running the whole app in a thread pool of somekind, which is causing the error.

I would be happy to accept a fix which turns a RuntimeError into a warning inside _add_shutdown_handlers, it should be a warning because without those handlers it's impossible to cancel a running download (i.e. interrupt the kernel etc).

The only snag is, there's no automated testing for the behaviour in a notebook (which relies on threads itself) so we would need to make sure everything there is working as expected. You would get a lot of bonus points if you could figure out a way to reproduce this without streamlit (i.e. run a script in a thread pool executor or something), because then we could add it to our test suite.

This sounds great and hank you for your comments, they helped me a lot.

I think I have progressed in this. So what I did is the following:

First, I tried to reproduce this withought streamlit. Well actually this was unexpectedly easier than I thought ;p

This can be done by the following logic:

import threading
from sunpy.net import Fido
from sunpy.net import attrs as a

def download_data():
    attrs_time = a.Time('2005/01/01 00:10', '2005/01/01 00:15')
    result = Fido.search(attrs_time, a.Instrument.eit)
    downloaded_files = Fido.fetch(result)
    return downloaded_files

def download_function():
    if threading.current_thread() != threading.main_thread():
        print("This code is running in a different thread.")
    else:
        print("This code is running in the main thread.")
    downloaded_files = download_data()

def main():
    # This part will run in the main thread.
    download_function()

    # This part will run in a different thread.
    thread = threading.Thread(target=download_function)  # Create a new thread
    thread.start()
    thread.join()

if __name__ == "__main__":
    main()

I use threading to run a part of the code out of the main thread. The first call of download_function() will run at the main thread, whereas, the second will run at a separate thread. Therefore, from the first call the data will normaly download normal while the second will result in an RuntimeError: set_wakeup_fd only works in main thread of the main interpreter.

Now to resolve the issue the best I have so far is the following:

I changed the static method of the parfive Downloader class into the following:

    @staticmethod
    def _add_shutdown_signals(loop, task):
        if os.name == "nt":
            return
        
        if threading.current_thread() != threading.main_thread():
            warnings.warn("/n Signal only works in main thread of the main interpreter and the handlers to interrupt the kernel have not been added.")
            return

        for sig in (signal.SIGINT, signal.SIGTERM):
            loop.add_signal_handler(sig, task.cancel)

Doing this there is an Exception ignored in: <function BaseEventLoop.__del__ at 0x1325d2820> but the data download withought an issue. Actually inside Streamlit I do not even get this exception.

I think I can use the first part to make a working test and the second to pass around the issue when the fucntion is called out of the main thread. What do you think of this?

This sounds great! Please go ahead and open a PR and we can make any tweaks we need to there.

I have opened a PR for this here #133