allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training take 2x longer since 1.13.0 with FastAI

mhtrinh opened this issue · comments

Training out model which is based on FastAI is taking 2x longer with Clearml 1.13.0 compare to 1.12.2

There are no error or warning

I cannot share our code. Here is the requirements.txt of the virtualenv:

absl-py==2.1.0
adal==1.2.7
adlfs==2023.8.0
aiobotocore==2.5.4
aiohttp==3.9.3
aioitertools==0.11.0
aiosignal==1.3.1
annotated-types==0.6.0
antlr4-python3-runtime==4.13.1
applicationinsights==0.11.10
argcomplete==3.1.6
async-timeout==4.0.3
attrs==23.2.0
azure-appconfiguration==1.1.1
azure-batch==14.1.0
azure-cli==2.58.0
azure-cli-core==2.58.0
azure-cli-telemetry==1.1.0
azure-common==1.1.28
azure-core==1.30.1
azure-cosmos==3.2.0
azure-data-tables==12.4.0
azure-datalake-store==0.0.53
azure-graphrbac==0.60.0
azure-identity==1.15.0
azure-keyvault-administration==4.4.0b2
azure-keyvault-certificates==4.7.0
azure-keyvault-keys==4.9.0b3
azure-keyvault-secrets==4.7.0
azure-mgmt-advisor==9.0.0
azure-mgmt-apimanagement==4.0.0
azure-mgmt-appconfiguration==3.0.0
azure-mgmt-appcontainers==2.0.0
azure-mgmt-applicationinsights==1.0.0
azure-mgmt-authorization==4.0.0
azure-mgmt-batch==17.2.0
azure-mgmt-batchai==7.0.0b1
azure-mgmt-billing==6.0.0
azure-mgmt-botservice==2.0.0
azure-mgmt-cdn==12.0.0
azure-mgmt-cognitiveservices==13.5.0
azure-mgmt-compute==30.4.0
azure-mgmt-containerinstance==10.1.0
azure-mgmt-containerregistry==10.3.0
azure-mgmt-containerservice==29.1.0
azure-mgmt-core==1.4.0
azure-mgmt-cosmosdb==9.4.0
azure-mgmt-databoxedge==1.0.0
azure-mgmt-datalake-nspkg==3.0.1
azure-mgmt-datalake-store==0.5.0
azure-mgmt-datamigration==10.0.0
azure-mgmt-devtestlabs==4.0.0
azure-mgmt-dns==8.0.0
azure-mgmt-eventgrid==10.2.0b2
azure-mgmt-eventhub==10.1.0
azure-mgmt-extendedlocation==1.0.0b2
azure-mgmt-hdinsight==9.0.0
azure-mgmt-imagebuilder==1.3.0
azure-mgmt-iotcentral==10.0.0b2
azure-mgmt-iothub==3.0.0
azure-mgmt-iothubprovisioningservices==1.1.0
azure-mgmt-keyvault==10.3.0
azure-mgmt-kusto==0.3.0
azure-mgmt-loganalytics==13.0.0b4
azure-mgmt-managedservices==1.0.0
azure-mgmt-managementgroups==1.0.0
azure-mgmt-maps==2.0.0
azure-mgmt-marketplaceordering==1.1.0
azure-mgmt-media==9.0.0
azure-mgmt-monitor==5.0.1
azure-mgmt-msi==7.0.0
azure-mgmt-netapp==10.1.0
azure-mgmt-nspkg==3.0.2
azure-mgmt-policyinsights==1.1.0b4
azure-mgmt-privatedns==1.0.0
azure-mgmt-rdbms==10.2.0b15
azure-mgmt-recoveryservices==2.5.0
azure-mgmt-recoveryservicesbackup==8.0.0
azure-mgmt-redhatopenshift==1.4.0
azure-mgmt-redis==14.3.0
azure-mgmt-resource==23.1.0b2
azure-mgmt-search==9.1.0
azure-mgmt-security==5.0.0
azure-mgmt-servicebus==8.2.0
azure-mgmt-servicefabric==1.0.0
azure-mgmt-servicefabricmanagedclusters==1.0.0
azure-mgmt-servicelinker==1.2.0b1
azure-mgmt-signalr==2.0.0b1
azure-mgmt-sql==4.0.0b15
azure-mgmt-sqlvirtualmachine==1.0.0b5
azure-mgmt-storage==21.1.0
azure-mgmt-synapse==2.1.0b5
azure-mgmt-trafficmanager==1.0.0
azure-mgmt-web==7.2.0
azure-monitor-query==1.2.0
azure-multiapi-storage==1.2.0
azure-nspkg==3.0.2
azure-storage-blob==12.19.1
azure-storage-common==1.4.2
azure-synapse-accesscontrol==0.5.0
azure-synapse-artifacts==0.18.0
azure-synapse-managedprivateendpoints==0.4.0
azure-synapse-spark==0.2.0
bcrypt==4.1.2
blis==0.7.11
botocore==1.31.17
cachetools==5.3.3
catalogue==2.0.10
certifi==2024.2.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
clearml==1.14.2
click==8.1.7
cloudpathlib==0.16.0
colorama==0.4.6
confection==0.1.4
contextlib2==21.6.0
contourpy==1.2.0
cryptography==42.0.5
cycler==0.12.1
cymem==2.0.8
decorator==5.1.1
Deprecated==1.2.14
distro==1.9.0
fabric==3.2.2
fastai==2.7.14
fastcore==1.5.29
fastdownload==0.0.7
fastprogress==1.0.3
filelock==3.13.1
fonttools==4.49.0
frozenlist==1.4.1
fsspec==2023.6.0
furl==2.1.3
gitdb==4.0.11
GitPython==3.1.42
google-auth==2.28.1
google-auth-oauthlib==1.0.0
grpcio==1.62.0
huggingface-hub==0.21.4
humanfriendly==10.0
idna==3.6
iniconfig==2.0.0
invoke==2.2.0
isodate==0.6.1
javaproperties==0.5.2
Jinja2==3.1.3
jmespath==1.0.1
joblib==1.3.2
jsondiff==2.0.0
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
knack==0.11.0
kornia==0.7.1
lakefs-client==0.107.0
langcodes==3.3.0
Markdown==3.5.2
MarkupSafe==2.1.5
matplotlib==3.8.3
ml-collections==0.1.1
mpmath==1.3.0
msal==1.26.0
msal-extensions==1.0.0
msrest==0.7.1
msrestazure==0.6.4
multidict==6.0.5
murmurhash==1.0.10
networkx==3.2.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
onnx==1.14.0
opencv-python-headless==4.8.1.78
orderedmultidict==1.0.1
packaging==23.2
pandas==2.2.1
paramiko==3.4.0
pathlib2==2.3.7.post1
Pillow==9.5.0
pipdeptree==2.13.0
pkginfo==1.10.0
portalocker==2.8.2
preshed==3.0.9
protobuf==4.25.3
psutil==5.9.8
pyarrow==13.0.0
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycomposefile==0.0.30
pycparser==2.21
pydantic==2.6.3
pydantic_core==2.16.3
PyGithub==1.59.1
Pygments==2.17.2
PyJWT==2.4.0
PyNaCl==1.5.0
pyOpenSSL==24.0.0
pyparsing==3.1.2
PySocks==1.7.1
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
referencing==0.33.0
requests==2.31.0
requests-oauthlib==1.3.1
rpds-py==0.18.0
rsa==4.9
s3fs==2023.6.0
safetensors==0.4.2
scikit-learn==1.4.1.post1
scipy==1.12.0
scp==0.13.6
seaborn==0.13.2
self-supervised==1.0.4
semver==2.13.0
six==1.16.0
smart-open==6.4.0
smmap==5.0.1
spacy==3.7.4
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.4.8
sshtunnel==0.1.5
sympy==1.12
tabulate==0.9.0
tensorboard==2.14.0
tensorboard-data-server==0.7.2
thinc==8.2.3
threadpoolctl==3.3.0
timm==0.9.16
torch==2.2.1
torchvision==0.17.1
tqdm==4.66.2
triton==2.2.0
typer==0.9.0
typing_extensions==4.10.0
tzdata==2024.1
urllib3==1.26.18
wasabi==1.1.2
weasel==0.3.4
websocket-client==1.3.3
Werkzeug==3.0.1
wrapt==1.16.0
xmltodict==0.13.0
yarl==1.9.4

Simply pip install clearml==1.12.2 and pip install clearml==1.13.0 and re-run the same code.

OS: openSUSE Leap 15.4

OS  Linux-5.14.21-150400.24.60-default-x86_64-with-glibc2.31
cpu_cores 20
gpu_count 1
gpu_driver_cuda_version 12.4
gpu_driver_version 550.54.14
gpu_memory 48GB
gpu_type NVIDIA RTX A6000

Hey @mhtrinh, have you observed the same slowdown with the newer versions of ClearML? The most recent one is 1.14.4

Yes, this happen also with the current version 1.14.4, as 2x slower.

Note : this may be specific to fastai as we have another network based on yolov5 and this is not happening

Hi @mhtrinh ! It looks like calculating the metrics that ClearML reports may take a long time. We will try to improve performance.
In the meantime, you could disable fastai bindings using auto_connect_frameworks={"fastai": False} in Task.init

Hi @mhtrinh ! We will release a fix for this issue in the next clearml release clearml==1.16.0

Hey @mhtrinh! Just letting you know that this issue has been resolved in the recently released v1.16.0. Let us know if there are any issues :)