sync_imports not working as intended

Question

sync_imports not working as intended

WernerFS opened this issue 7 months ago · comments

Hello,

I am working on TLJH. Previously, the code worked fine, but now I get errors when performing imports in remote hosts:

ipyparallel==8.6.1

import ipyparallel as ipp

engines = 1

cluster = ipp.Cluster(profile= "ssh", n= engines) # profile is short hand for profile_dir/profile_""
rc = cluster.start_and_connect_sync()

dview = rc[:]

with dview.sync_imports():
    import numpy

Return:

[Engine Exception]:
Traceback (most recent call last):

  File "/opt/tljh/user/lib/python3.10/site-packages/ipyparallel/client/client.py", line 885, in _handle_stranded_msgs
    raise error.EngineError(

ipyparallel.error.EngineError: Engine 0 died while running task 'c965c310-bca08b3ee2a0a3c1be6caa3f_59412_1'
fetching /tmp/tmpgjrkj09r/ipengine-1700667075.2026.out from user@192.168.0.6:.ipython/profile_ssh/log/ipengine-1700667075.2026.out
Removing user@192.168.0.6:.ipython/profile_ssh/log/ipengine-1700667075.2026.out
engine set stopped 1700667068: {'engines': {'user@192.168.0.6/0': {'exit_code': -1, 'pid': 2754, 'identifier': 'user@192.168.0.6/0'}}, 'exit_code': -1}

However the error changes when the import changes and fails in the remote_import() call:

import ipyparallel as ipp

engines = 1

cluster = ipp.Cluster(profile= "ssh", n= engines) # profile is short hand for profile_dir/profile_""
rc = cluster.start_and_connect_sync()

dview = rc[:]

with dview.sync_imports():
    from numpy import random

Return:

importing random from numpy on engine(s)
[0:apply]:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <string>:1

File /opt/tljh/user/lib/python3.10/site-packages/ipyparallel/client/view.py:437, in remote_import(name, fromlist, level)

TypeError: 'str' object is not callable

Werner · Answer 1 · Thu Nov 23 2023 16:12:54 GMT+0800 (China Standard Time)

As you all can guess this is an issue of my implementation. It works fine when working on local clusters. I checked the ipyparallel version on remote and host, both = 8.6.1.

The host runs Python 3.10.12 and the remote is on 3.9.2. My first guess was that maybe the globals() call inside the remote_import() function was changed in the latest Python version. Sadly, this is not the case.

I will keep posting my findings in case this is relevant to anyone.

Werner · Answer 2 · Thu Nov 23 2023 16:46:35 GMT+0800 (China Standard Time)

More on my implementation:

TLJH:

Linux 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC x86_64 x86_64 x86_64 GNU/Linux
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

Remote nodes:

Linux hyperfpga-3be11-3-2 5.15.36-xilinx-v2022.2 #1 SMP aarch64 GNU/Linux
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye

Werner · Answer 3 · Thu Nov 23 2023 23:35:01 GMT+0800 (China Standard Time)

To debug the issue, I went back to the toy experiments to check that it's only when sync_imports() is executed.
The (Load balanced map and parallel function decorator)[https://github.com/ipython/ipyparallel/blob/main/docs/source/examples/Parallel%20Decorator%20and%20map.ipynb] works as expected.

It succeeds when submitting tasks and retrieving the results:

Submitted tasks, got ids:  ['4aae4be3-0e1d348191cd935771cece3f_76913_11', '4aae4be3-0e1d348191cd935771cece3f_76913_12', '4aae4be3-0e1d348191cd935771cece3f_76913_13', '4aae4be3-0e1d348191cd935771cece3f_76913_14', '4aae4be3-0e1d348191cd935771cece3f_76913_15', '4aae4be3-0e1d348191cd935771cece3f_76913_16', '4aae4be3-0e1d348191cd935771cece3f_76913_17', '4aae4be3-0e1d348191cd935771cece3f_76913_18', '4aae4be3-0e1d348191cd935771cece3f_76913_19', '4aae4be3-0e1d348191cd935771cece3f_76913_20']
Using a mapper:  [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Werner · Answer 4 · Fri Nov 24 2023 00:44:20 GMT+0800 (China Standard Time)

This also works fine:

@d.parallel(block=True)
def df(x): 
    import numpy
    return( x * numpy.random.randint(100))

result = df.map(range(10))
print("Submitted tasks, got ids: ")

print("Using a parallel function in direct view: ", result)

Which may work as an alternative to sync_imports().

Werner · Answer 5 · Fri Nov 24 2023 18:28:17 GMT+0800 (China Standard Time)

I believe the problem is that in my remote I have two Python in the Python3 variable. This is stupid but easily solvable by simply specifying the Python binary to use in the engines in the ipcluster_config.py file.

c.SSHEngineSetLauncher.remote_python = "/usr/bin/python3.9"

Don't be dumb like me, specify your remote_python in a virtual environment or use conda.

Min RK · Answer 6 · Fri Nov 24 2023 23:10:57 GMT+0800 (China Standard Time)

Glad you figured it out! IPython Parallel's code serialization isn't stable across different Python version. It may work sometimes, but won't in general. If you use cloudpickle (cluster[:].use_cloudpickle()), it might be more reliable. But I think this approach also means things like sync_imports won't work, because it changes how globals are resolved.

Werner · Answer 7 · Fri Nov 24 2023 23:44:21 GMT+0800 (China Standard Time)

I actually wrote that comment while waiting for the test to finish, overly optimistic. Sadly, my implementation is way more broken than I imagined.

I got around the imports by using %px import magic. The issue now is that traceback is breaking up on engines.

Traceback (most recent call last):
  File "/tmp/ipykernel_99099/659900893.py", line 11, in calculate_solutions_fpga
TypeError: 'enumerate' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mlabadm/.local/lib/python3.9/site-packages/ipyparallel/engine/kernel.py", line 199, in do_apply
    exec(code, shell.user_global_ns, shell.user_ns)
  File "<string>", line 1, in <module>
  File "/opt/tljh/user/lib/python3.10/site-packages/ipyparallel/client/remotefunction.py", line 148, in <lambda>
    _map = lambda f, *sequences: list(map(f, *sequences))
  File "/tmp/ipykernel_99099/659900893.py", line 24, in calculate_solutions_fpga
TypeError: 'traceback' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mlabadm/.local/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2057, in showtraceback
    stb = self.InteractiveTB.structured_traceback(
  File "/home/mlabadm/.local/lib/python3.9/site-packages/IPython/core/ultratb.py", line 1288, in structured_traceback
    return FormattedTB.structured_traceback(
  File "/home/mlabadm/.local/lib/python3.9/site-packages/IPython/core/ultratb.py", line 1177, in structured_traceback
    return VerboseTB.structured_traceback(
  File "/home/mlabadm/.local/lib/python3.9/site-packages/IPython/core/ultratb.py", line 1049, in structured_traceback
    formatted_exceptions += self.format_exception_as_a_whole(etype, evalue, etb, lines_of_context,
  File "/home/mlabadm/.local/lib/python3.9/site-packages/IPython/core/ultratb.py", line 935, in format_exception_as_a_whole
    self.get_records(etb, number_of_lines_of_context, tb_offset) if etb else []
  File "/home/mlabadm/.local/lib/python3.9/site-packages/IPython/core/ultratb.py", line 1003, in get_records
    lines, first = inspect.getsourcelines(etb.tb_frame)
  File "/usr/lib/python3.9/inspect.py", line 1006, in getsourcelines
    lines, lnum = findsource(object)
  File "/usr/lib/python3.9/inspect.py", line 827, in findsource
    raise OSError('source code not available')
OSError: source code not available

At this point, I suspect it may be better to start from scratch. Nonetheless any suggestion is welcomed.

Werner · Answer 8 · Sat Nov 25 2023 17:54:57 GMT+0800 (China Standard Time)

Please, please, please. Make yourself a favor and just use the same Python version on your remote engines and your host.

Min RK · Answer 9 · Mon Nov 27 2023 18:50:18 GMT+0800 (China Standard Time)

Yeah, in general cross-versions isn't supported, though it might work. We could probably put in some more visible warnings when that situation is detected.