Tasks fail when running an agent inside Notebook 2: Remote Agent
sobek1886 opened this issue · comments
Describe the bug
A clear and concise description of what the bug is.
Hi. I'm going through the getting started Colab notebooks. I run into an issue when trying to execute tasks using an agent set-up through the 2nd tutorial notebook.
I start the agent using
!clearml-agent daemon --queue "default" --foreground
Then, every time when I enqueue a task, it fails. Example terminal output:
Executing task id [bfd4a833c8ba497794ecda6b4332b3aa]:
repository =
branch =
version_num =
tag =
docker_cmd =
entry_point = colab_kernel_launcher.py
working_dir = .
::: Using Cached environment /root/.clearml/venvs-cache/6323bc2138a003ed039770fdc7da9483.b0b672ff6952198853ffbbe536f4e324 :::
Adding venv into cache: /root/.clearml/venvs-builds/3.10
Running task id [bfd4a833c8ba497794ecda6b4332b3aa]:
[.]$ /root/.clearml/venvs-builds/3.10/bin/python -u /root/.clearml/venvs-builds/3.10/code/colab_kernel_launcher.py
Summary - installed python packages:
pip:
- asttokens==2.4.1
- attrs==23.2.0
- cachetools==5.3.2
- certifi==2024.2.2
- charset-normalizer==3.3.2
- clearml==1.14.3
- Cython==3.0.8
- decorator==5.1.1
- exceptiongroup==1.2.0
- executing==2.0.1
- furl==2.1.3
- google-api-core==2.17.0
- google-auth==2.27.0
- google-cloud-core==2.4.1
- google-cloud-storage==2.8.0
- google-crc32c==1.5.0
- google-resumable-media==2.7.0
- googleapis-common-protos==1.62.0
- idna==3.6
- ipykernel==5.5.6
- ipython==8.21.0
- ipython-genutils==0.2.0
- jedi==0.19.1
- jsonschema==4.21.1
- jsonschema-specifications==2023.12.1
- jupyter_client==8.6.0
- jupyter_core==5.7.1
- matplotlib-inline==0.1.6
- numpy==1.26.4
- orderedmultidict==1.0.1
- parso==0.8.3
- pathlib2==2.3.7.post1
- pexpect==4.9.0
- pillow==10.2.0
- platformdirs==4.2.0
- prompt-toolkit==3.0.43
- protobuf==4.25.2
- psutil==5.9.8
- ptyprocess==0.7.0
- pure-eval==0.2.2
- pyasn1==0.5.1
- pyasn1-modules==0.3.0
- Pygments==2.17.2
- PyJWT==2.8.0
- pyparsing==3.1.1
- python-dateutil==2.8.2
- PyYAML==6.0.1
- pyzmq==23.2.1
- referencing==0.33.0
- requests==2.31.0
- rpds-py==0.18.0
- rsa==4.9
- six==1.16.0
- stack-data==0.6.3
- tornado==6.4
- traitlets==5.14.1
- urllib3==2.2.0
- wcwidth==0.2.13
Environment setup completed successfully
Starting Task Execution:
[ColabKernelApp] CRITICAL | Bad config encountered during initialization: The 'kernel_class' trait of <__main__.ColabKernelApp object at 0x7c5aa3978160> instance must be a type, but 'google.colab._kernel.Kernel' could not be imported
Leaving process id 3941
DONE: Running task 'bfd4a833c8ba497794ecda6b4332b3aa', exit status 1
Expected behaviour
What is the expected behaviour? What should've happened but didn't?
I expected the tasks to execute successfully.
Environment
- Server type (app.clear.ml)
- Colab
Having the same issue here following the tutorial for remote colab agents
The problem is the diff you probably have in your task, sent #1220 to hopefully fix.
This is the diff ("uncomitted changes") you'll probably see if you open the task in your ClearML project:
# Copyright 2023 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Custom kernel launcher app to customize socket options."""
from ipykernel import kernelapp
import zmq
# We want to set the high water mark on *all* sockets to 0, as we don't want
# the backend dropping any messages. We want to set this before any calls to
# bind or connect.
#
# In principle we should override `init_sockets`, but it's hard to set options
# on the `zmq.Context` there without rewriting the entire method. Instead we
# settle for only setting this on `iopub`, as that's the most important for our
# use case.
class ColabKernelApp(kernelapp.IPKernelApp):
def init_iopub(self, context):
context.setsockopt(zmq.RCVHWM, 0)
context.setsockopt(zmq.SNDHWM, 0)
return super().init_iopub(context)
if __name__ == '__main__':
ColabKernelApp.launch_instance()
The problem is the diff you probably have in your task, sent #1220 to hopefully fix.
This is the diff ("uncomitted changes") you'll probably see if you open the task in your ClearML project:
# Copyright 2023 Google Inc. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Custom kernel launcher app to customize socket options.""" from ipykernel import kernelapp import zmq # We want to set the high water mark on *all* sockets to 0, as we don't want # the backend dropping any messages. We want to set this before any calls to # bind or connect. # # In principle we should override `init_sockets`, but it's hard to set options # on the `zmq.Context` there without rewriting the entire method. Instead we # settle for only setting this on `iopub`, as that's the most important for our # use case. class ColabKernelApp(kernelapp.IPKernelApp): def init_iopub(self, context): context.setsockopt(zmq.RCVHWM, 0) context.setsockopt(zmq.SNDHWM, 0) return super().init_iopub(context) if __name__ == '__main__': ColabKernelApp.launch_instance()
From my side the issue also occured with normal projects, it was just public replicable on the tutorial too.
Will try again tomorrow
Hey @sobek1886! Just letting you know that this issue has been resolved in v1.15.0. Let us know if there are any issues :)