allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ClearML does not find all packages

terbed opened this issue · comments

Describe the bug

I cannot reproduce experiments remotely, because the environment is improperly constructed.
The recognized packages:

# Python 3.11.8 (main, Feb 12 2024, 14:50:05) [GCC 13.2.1 20230801]

clearml == 1.15.1
kiwisolver == 1.4.5
lightning == 2.2.0.post0
torch == 2.2.0+cu118

Actual packages in the environment:

numpy==1.26.3
PyYAML==6.0.1
torch==2.2.0+cu118
torchmetrics==1.3.1
torchvision==0.17.0+cu118
tqdm==4.66.2
lightning==2.2.0.post0
lightning[pytorch-extra]
matplotlib
pandas

So the remotely reproduced training fails because torchvision is not installed in the env:


- nvidia-nccl-cu12==2.19.3
- nvidia-nvjitlink-cu12==12.4.127
- nvidia-nvtx-cu12==12.1.105
- orderedmultidict==1.0.1
- packaging==24.0
- pathlib2==2.3.7.post1
- pillow==10.3.0
- platformdirs==4.2.0
- psutil==5.9.8
- PyJWT==2.8.0
- pyparsing==3.1.2
- python-dateutil==2.8.2
- pytorch-lightning==2.2.1
- PyYAML==6.0.1
- referencing==0.34.0
- requests==2.31.0
- rpds-py==0.18.0
- six==1.16.0
- sympy==1.12
- torch==2.2.0+cu121
- torchmetrics==1.3.2
- tqdm==4.66.2
- triton==2.2.0
- typing_extensions==4.11.0
- urllib3==1.26.18
- virtualenv==20.25.1
- yarl==1.9.4
Environment setup completed successfully
Starting Task Execution:
2024-04-11 22:42:01
Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.10/task_repository/PhaseReconstruction.git/main.py", line 2, in <module>
    from src.data import PRDataModule
  File "/root/.clearml/venvs-builds/3.10/task_repository/PhaseReconstruction.git/src/data.py", line 3, in <module>
    import torchvision.transforms.functional as tvf
ModuleNotFoundError: No module named 'torchvision'
2024-04-11 22:42:01
Process failed, exit code 1

Environment

  • Self-hosted
  • WebApp: 1.15.0-472 • Server: 1.15.0-472 • API: 2.29
  • Python Version 3.11
  • Linux Ubuntu 22

Related Discussion

I have a similar issue #1198 which has also not been resolved yet. But in my case I found a very strange solution is to add a line import tmp in my main.py to execute, where tmp.py is an empty file (no code inside it) in the same folder.

Hi @wxdrizzle,
Thanks for linking in your similar issue.

Some updates:

  • I realized that the __init__.py file was missing in my src folder. Adding this file clearml successfully recognized the torchvision package and some additional missing packages.
  • It still could not recognize lightning[pytorch-extra] package.
  • @wxdrizzle's workaround does not help on this (that could be related to the missing __init__.py)

Hi @terbed @wxdrizzle ! We are using a pigar fork to auto-fetch the requirements. Only top-level imports will be fetched (see faqs): https://github.com/damnever/pigar#faq. Also, note that only packages will be inspected, so the __init__.py is mandatory if you wish local files to be inspected.

There are a few ways to specify other packages/other auto-detection machanism:

  1. Use one of these functions to explicitly mention packages:
  1. Use https://clear.ml/docs/latest/docs/references/sdk/task#taskforce_requirements_env_freeze to use pip freeze or conda list for package detection (or set sdk.development.detect_with_pip_freeze or development.detect_with_conda_freeze to true in clearml.conf to achieve the same thing)

Hi @eugen-ajechiloae-clearml,
Thank you for the information, this explains everything! :)

Best wishes,
Daniel

Thank you very much for the detailed explanation!