Training take 2x longer since 1.13.0 with FastAI
mhtrinh opened this issue · comments
Training out model which is based on FastAI is taking 2x longer with Clearml 1.13.0 compare to 1.12.2
There are no error or warning
I cannot share our code. Here is the requirements.txt of the virtualenv:
absl-py==2.1.0
adal==1.2.7
adlfs==2023.8.0
aiobotocore==2.5.4
aiohttp==3.9.3
aioitertools==0.11.0
aiosignal==1.3.1
annotated-types==0.6.0
antlr4-python3-runtime==4.13.1
applicationinsights==0.11.10
argcomplete==3.1.6
async-timeout==4.0.3
attrs==23.2.0
azure-appconfiguration==1.1.1
azure-batch==14.1.0
azure-cli==2.58.0
azure-cli-core==2.58.0
azure-cli-telemetry==1.1.0
azure-common==1.1.28
azure-core==1.30.1
azure-cosmos==3.2.0
azure-data-tables==12.4.0
azure-datalake-store==0.0.53
azure-graphrbac==0.60.0
azure-identity==1.15.0
azure-keyvault-administration==4.4.0b2
azure-keyvault-certificates==4.7.0
azure-keyvault-keys==4.9.0b3
azure-keyvault-secrets==4.7.0
azure-mgmt-advisor==9.0.0
azure-mgmt-apimanagement==4.0.0
azure-mgmt-appconfiguration==3.0.0
azure-mgmt-appcontainers==2.0.0
azure-mgmt-applicationinsights==1.0.0
azure-mgmt-authorization==4.0.0
azure-mgmt-batch==17.2.0
azure-mgmt-batchai==7.0.0b1
azure-mgmt-billing==6.0.0
azure-mgmt-botservice==2.0.0
azure-mgmt-cdn==12.0.0
azure-mgmt-cognitiveservices==13.5.0
azure-mgmt-compute==30.4.0
azure-mgmt-containerinstance==10.1.0
azure-mgmt-containerregistry==10.3.0
azure-mgmt-containerservice==29.1.0
azure-mgmt-core==1.4.0
azure-mgmt-cosmosdb==9.4.0
azure-mgmt-databoxedge==1.0.0
azure-mgmt-datalake-nspkg==3.0.1
azure-mgmt-datalake-store==0.5.0
azure-mgmt-datamigration==10.0.0
azure-mgmt-devtestlabs==4.0.0
azure-mgmt-dns==8.0.0
azure-mgmt-eventgrid==10.2.0b2
azure-mgmt-eventhub==10.1.0
azure-mgmt-extendedlocation==1.0.0b2
azure-mgmt-hdinsight==9.0.0
azure-mgmt-imagebuilder==1.3.0
azure-mgmt-iotcentral==10.0.0b2
azure-mgmt-iothub==3.0.0
azure-mgmt-iothubprovisioningservices==1.1.0
azure-mgmt-keyvault==10.3.0
azure-mgmt-kusto==0.3.0
azure-mgmt-loganalytics==13.0.0b4
azure-mgmt-managedservices==1.0.0
azure-mgmt-managementgroups==1.0.0
azure-mgmt-maps==2.0.0
azure-mgmt-marketplaceordering==1.1.0
azure-mgmt-media==9.0.0
azure-mgmt-monitor==5.0.1
azure-mgmt-msi==7.0.0
azure-mgmt-netapp==10.1.0
azure-mgmt-nspkg==3.0.2
azure-mgmt-policyinsights==1.1.0b4
azure-mgmt-privatedns==1.0.0
azure-mgmt-rdbms==10.2.0b15
azure-mgmt-recoveryservices==2.5.0
azure-mgmt-recoveryservicesbackup==8.0.0
azure-mgmt-redhatopenshift==1.4.0
azure-mgmt-redis==14.3.0
azure-mgmt-resource==23.1.0b2
azure-mgmt-search==9.1.0
azure-mgmt-security==5.0.0
azure-mgmt-servicebus==8.2.0
azure-mgmt-servicefabric==1.0.0
azure-mgmt-servicefabricmanagedclusters==1.0.0
azure-mgmt-servicelinker==1.2.0b1
azure-mgmt-signalr==2.0.0b1
azure-mgmt-sql==4.0.0b15
azure-mgmt-sqlvirtualmachine==1.0.0b5
azure-mgmt-storage==21.1.0
azure-mgmt-synapse==2.1.0b5
azure-mgmt-trafficmanager==1.0.0
azure-mgmt-web==7.2.0
azure-monitor-query==1.2.0
azure-multiapi-storage==1.2.0
azure-nspkg==3.0.2
azure-storage-blob==12.19.1
azure-storage-common==1.4.2
azure-synapse-accesscontrol==0.5.0
azure-synapse-artifacts==0.18.0
azure-synapse-managedprivateendpoints==0.4.0
azure-synapse-spark==0.2.0
bcrypt==4.1.2
blis==0.7.11
botocore==1.31.17
cachetools==5.3.3
catalogue==2.0.10
certifi==2024.2.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
clearml==1.14.2
click==8.1.7
cloudpathlib==0.16.0
colorama==0.4.6
confection==0.1.4
contextlib2==21.6.0
contourpy==1.2.0
cryptography==42.0.5
cycler==0.12.1
cymem==2.0.8
decorator==5.1.1
Deprecated==1.2.14
distro==1.9.0
fabric==3.2.2
fastai==2.7.14
fastcore==1.5.29
fastdownload==0.0.7
fastprogress==1.0.3
filelock==3.13.1
fonttools==4.49.0
frozenlist==1.4.1
fsspec==2023.6.0
furl==2.1.3
gitdb==4.0.11
GitPython==3.1.42
google-auth==2.28.1
google-auth-oauthlib==1.0.0
grpcio==1.62.0
huggingface-hub==0.21.4
humanfriendly==10.0
idna==3.6
iniconfig==2.0.0
invoke==2.2.0
isodate==0.6.1
javaproperties==0.5.2
Jinja2==3.1.3
jmespath==1.0.1
joblib==1.3.2
jsondiff==2.0.0
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
knack==0.11.0
kornia==0.7.1
lakefs-client==0.107.0
langcodes==3.3.0
Markdown==3.5.2
MarkupSafe==2.1.5
matplotlib==3.8.3
ml-collections==0.1.1
mpmath==1.3.0
msal==1.26.0
msal-extensions==1.0.0
msrest==0.7.1
msrestazure==0.6.4
multidict==6.0.5
murmurhash==1.0.10
networkx==3.2.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
onnx==1.14.0
opencv-python-headless==4.8.1.78
orderedmultidict==1.0.1
packaging==23.2
pandas==2.2.1
paramiko==3.4.0
pathlib2==2.3.7.post1
Pillow==9.5.0
pipdeptree==2.13.0
pkginfo==1.10.0
portalocker==2.8.2
preshed==3.0.9
protobuf==4.25.3
psutil==5.9.8
pyarrow==13.0.0
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycomposefile==0.0.30
pycparser==2.21
pydantic==2.6.3
pydantic_core==2.16.3
PyGithub==1.59.1
Pygments==2.17.2
PyJWT==2.4.0
PyNaCl==1.5.0
pyOpenSSL==24.0.0
pyparsing==3.1.2
PySocks==1.7.1
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
referencing==0.33.0
requests==2.31.0
requests-oauthlib==1.3.1
rpds-py==0.18.0
rsa==4.9
s3fs==2023.6.0
safetensors==0.4.2
scikit-learn==1.4.1.post1
scipy==1.12.0
scp==0.13.6
seaborn==0.13.2
self-supervised==1.0.4
semver==2.13.0
six==1.16.0
smart-open==6.4.0
smmap==5.0.1
spacy==3.7.4
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.4.8
sshtunnel==0.1.5
sympy==1.12
tabulate==0.9.0
tensorboard==2.14.0
tensorboard-data-server==0.7.2
thinc==8.2.3
threadpoolctl==3.3.0
timm==0.9.16
torch==2.2.1
torchvision==0.17.1
tqdm==4.66.2
triton==2.2.0
typer==0.9.0
typing_extensions==4.10.0
tzdata==2024.1
urllib3==1.26.18
wasabi==1.1.2
weasel==0.3.4
websocket-client==1.3.3
Werkzeug==3.0.1
wrapt==1.16.0
xmltodict==0.13.0
yarl==1.9.4
Simply pip install clearml==1.12.2
and pip install clearml==1.13.0
and re-run the same code.
OS: openSUSE Leap 15.4
OS Linux-5.14.21-150400.24.60-default-x86_64-with-glibc2.31
cpu_cores 20
gpu_count 1
gpu_driver_cuda_version 12.4
gpu_driver_version 550.54.14
gpu_memory 48GB
gpu_type NVIDIA RTX A6000
Hey @mhtrinh, have you observed the same slowdown with the newer versions of ClearML? The most recent one is 1.14.4
Yes, this happen also with the current version 1.14.4, as 2x slower.
Note : this may be specific to fastai as we have another network based on yolov5 and this is not happening
Hi @mhtrinh ! It looks like calculating the metrics that ClearML reports may take a long time. We will try to improve performance.
In the meantime, you could disable fastai
bindings using auto_connect_frameworks={"fastai": False}
in Task.init
Hi @mhtrinh ! We will release a fix for this issue in the next clearml release clearml==1.16.0