BUG: Neptune v1.3.2 is slow when using forked data loaders in Linux

Question

BUG: Neptune v1.3.2 is slow when using forked data loaders in Linux

issamemari opened this issue a year ago · comments

Describe the bug

I have a training script that uses the Trainer class from the HuggingFace Transformers library to train a text classification model. The Trainer class uses a NeptuneCallback to log training metrics to Neptune.

I noticed that my training script is inexplicably slow, so I profiled it using py-spy and found that neptune/internal/threading/daemon.py is taking more than 90% of the execution time of my script. I tried disabling the Neptune callback and saw a dramatic improvement in the speed of the training script.

Reproduction

Here's my training code. I have the Neptune client installed and project name and API key in the environment variables so the Trainer class automatically send logs to Neptune using the NeptuneCallback class.

import logging

import hydra
from transformers import Trainer, TrainingArguments

from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor

logger = logging.getLogger(__name__)


@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
    logger.info("Creating dataset")
    train_dataset = LukalficationDataset(**config.dataset.train)
    eval_dataset = LukalficationDataset(**config.dataset.eval)

    logger.info("Creating preprocessor")
    processor = Processor(
        tokenizer_config=config.tokenizer,
        label_encoders=[
            MultiLabelOneHotEncoder(LEVEL1),
            MultiLabelOneHotEncoder(LEVEL2),
            MultiLabelOneHotEncoder(LEVEL3),
        ],
    )

    logger.info("Creating model")
    model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)

    training_args = TrainingArguments(**config.training)

    logger.info("Training model")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=processor.data_collator,
        compute_metrics=MetricsLogger(
            label_encoders=processor.label_encoders
        ).compute_metrics,
    )

    trainer.train()

    logger.info("Training complete")


if __name__ == "__main__":
    main()

Expected behavior

I expect logging to Neptune to not impact the speed of my training script.

Environment

The output of pip list:

Package                    Version
-------------------------- ---------
accelerate                 0.20.3
aiobotocore                2.5.2
aiohttp                    3.8.4
aiohttp-retry              2.8.3
aioitertools               0.11.0
aiosignal                  1.3.1
amqp                       5.1.1
annotated-types            0.5.0
antlr4-python3-runtime     4.9.3
appdirs                    1.4.4
appnope                    0.1.3
arrow                      1.2.3
asn1crypto                 1.5.1
asttokens                  2.2.1
async-timeout              4.0.2
asyncssh                   2.13.2
atpublic                   4.0
attrs                      23.1.0
backcall                   0.2.0
backoff                    2.2.1
beautifulsoup4             4.12.2
billiard                   4.1.0
boto3                      1.26.161
botocore                   1.29.161
bravado                    11.0.3
bravado-core               5.17.1
celery                     5.3.1
certifi                    2023.5.7
cffi                       1.15.1
charset-normalizer         3.2.0
click                      8.1.4
click-didyoumean           0.3.0
click-plugins              1.1.1
click-repl                 0.3.0
colorama                   0.4.6
comm                       0.1.3
configobj                  5.0.8
cryptography               40.0.2
debugpy                    1.6.7
decorator                  5.1.1
dictdiffer                 0.9.0
diskcache                  5.6.1
distro                     1.8.0
dpath                      2.1.6
dulwich                    0.21.5
dvc                        3.5.1
dvc-data                   2.5.0
dvc-http                   2.30.2
dvc-objects                0.23.0
dvc-render                 0.5.3
dvc-s3                     2.23.0
dvc-studio-client          0.11.0
dvc-task                   0.3.0
executing                  1.2.0
filelock                   3.12.2
flatten-dict               0.4.2
flufl.lock                 7.1.1
fqdn                       1.5.1
frozenlist                 1.3.3
fsspec                     2023.6.0
funcy                      2.0
future                     0.18.3
gitdb                      4.0.10
GitPython                  3.1.32
grandalf                   0.8
huggingface-hub            0.16.4
hydra-core                 1.3.2
idna                       3.4
ipykernel                  6.24.0
ipython                    8.14.0
isoduration                20.11.0
iterative-telemetry        0.0.8
jedi                       0.18.2
Jinja2                     3.1.2
jmespath                   1.0.1
joblib                     1.3.1
jsonpointer                2.4
jsonref                    1.1.0
jsonschema                 4.18.0
jsonschema-specifications  2023.6.1
jupyter_client             8.3.0
jupyter_core               5.3.1
kombu                      5.3.1
markdown-it-py             3.0.0
MarkupSafe                 2.1.3
matplotlib-inline          0.1.6
mdurl                      0.1.2
monotonic                  1.6
mpmath                     1.3.0
msgpack                    1.0.5
multidict                  6.0.4
nanotime                   0.5.2
neptune                    1.3.2
nest-asyncio               1.5.6
networkx                   3.1
numpy                      1.25.1
oauthlib                   3.2.2
omegaconf                  2.3.0
orjson                     3.9.2
oscrypto                   1.3.0
packaging                  23.1
pandas                     2.0.3
parso                      0.8.3
pathspec                   0.11.1
pexpect                    4.8.0
pickleshare                0.7.5
Pillow                     10.0.0
pip                        23.1.2
platformdirs               3.8.1
prompt-toolkit             3.0.39
protobuf                   3.19.6
psutil                     5.9.5
ptyprocess                 0.7.0
pure-eval                  0.2.2
pyarrow                    10.0.1
pycparser                  2.21
pycryptodomex              3.18.0
pydantic                   2.0.2
pydantic_core              2.1.2
pydot                      1.4.2
pygit2                     1.12.2
Pygments                   2.15.1
pygtrie                    2.5.0
PyJWT                      2.7.0
pyOpenSSL                  23.2.0
pyparsing                  3.1.0
python-dateutil            2.8.2
pytz                       2023.3
PyYAML                     6.0
pyzmq                      25.1.0
referencing                0.29.1
regex                      2023.6.3
requests                   2.31.0
requests-oauthlib          1.3.1
rfc3339-validator          0.1.4
rfc3987                    1.3.8
rich                       13.4.2
rpds-py                    0.8.10
ruamel.yaml                0.17.32
ruamel.yaml.clib           0.2.7
s3fs                       2023.6.0
s3transfer                 0.6.1
safetensors                0.3.1
scikit-learn               1.3.0
scipy                      1.11.1
scmrepo                    1.0.4
sentencepiece              0.1.99
setuptools                 68.0.0
shortuuid                  1.0.11
shtab                      1.6.2
simplejson                 3.19.1
six                        1.16.0
smmap                      5.0.0
snowflake-connector-python 3.0.4
sortedcontainers           2.4.0
soupsieve                  2.4.1
sqltrie                    0.7.0
stack-data                 0.6.2
swagger-spec-validator     3.0.3
sympy                      1.12
tabulate                   0.9.0
threadpoolctl              3.1.0
tokenizers                 0.13.3
tomlkit                    0.11.8
torch                      2.0.1
tornado                    6.3.2
tqdm                       4.65.0
traitlets                  5.9.0
transformers               4.30.2
typing_extensions          4.7.1
tzdata                     2023.3
uri-template               1.3.0
urllib3                    1.26.16
vine                       5.0.0
voluptuous                 0.13.1
wcwidth                    0.2.6
webcolors                  1.13
websocket-client           1.6.1
wheel                      0.40.0
wrapt                      1.15.0
yarl                       1.9.2
zc.lockfile                3.0.post1

The operating system you're using:
Linux

The output of python --version:
Python 3.11.3

Siddhant Sadangi · Answer 1 · Wed Jul 12 2023 23:01:38 GMT+0800 (China Standard Time)

Hello @issamemari 👋

Are you logging model checkpoints to Neptune? Syncing might take a while in case of large checkpoints uploaded frequently.

To check if this is really the case, can you use the callback method of logging with log_checkpoints="best" instead of using Transformer's default report_to? This will upload only the best model checkpoint instead of a checkpoint every 500 steps.

Be sure to update report_to to "none" in TrainingArguments() to avoid creating multiple callbacks.

Using the callback method: https://docs.neptune.ai/integrations/transformers/#__tabbed_2_2
NeptuneCallback() API reference: https://docs.neptune.ai/api/integrations/transformers/#neptunecallback

Please let me know if this works better for you 🙏

Santiago Castro · Answer 2 · Sat Jul 15 2023 06:58:45 GMT+0800 (China Standard Time)

I've noticed something similar after upgrading Neptune (not sure what version I was using before but it's from the last month or so) but without using Trainer from transformers (just PyTorch).

Siddhant Sadangi · Answer 3 · Mon Jul 17 2023 15:50:39 GMT+0800 (China Standard Time)

Hello @bryant1410 ,
Could you please create a new issue with your case's details so we can have a closer look?

Issa Memari · Answer 4 · Mon Jul 17 2023 20:46:51 GMT+0800 (China Standard Time)

I am not logging model checkpoints to Neptune, I'm just saving them to disk. I tried what you suggested, here's the code:

import logging

import hydra
from transformers import Trainer, TrainingArguments
from transformers.integrations import NeptuneCallback

from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor

logger = logging.getLogger(__name__)


@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
    logger.info("Creating dataset")
    train_dataset = LukalficationDataset(**config.dataset.train)
    eval_dataset = LukalficationDataset(**config.dataset.eval)

    logger.info("Creating preprocessor")
    processor = Processor(
        tokenizer_config=config.tokenizer,
        label_encoders=[
            MultiLabelOneHotEncoder(LEVEL1),
            MultiLabelOneHotEncoder(LEVEL2),
            MultiLabelOneHotEncoder(LEVEL3),
        ],
    )

    logger.info("Creating model")
    model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)

    training_args = TrainingArguments(**config.training)

    logger.info("Training model")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=processor.data_collator,
        compute_metrics=MetricsLogger(
            label_encoders=processor.label_encoders
        ).compute_metrics,
        callbacks=[NeptuneCallback(log_checkpoints="best")],
    )

    trainer.train()

    logger.info("Training complete")


if __name__ == "__main__":
    main()

And here's the config file:

model:
  freeze_backbone: false
  checkpoint_path: null


dataset:
  train:
    csv_file: ${hydra:runtime.cwd}/data/train.csv
  eval:
    csv_file: ${hydra:runtime.cwd}/data/eval.csv


tokenizer:
  init:
    pretrained_model_name_or_path: "Unbabel/xlm-roberta-comet-small"
    truncation_side: "left"
  call:
    max_length: 512
    padding: longest
    truncation: true
    return_tensors: "pt"


training:
  output_dir: ${hydra:runtime.cwd}/checkpoints/${now:%Y-%m-%d}/${now:%H-%M}
  num_train_epochs: 50
  per_device_train_batch_size: 48
  per_device_eval_batch_size: 48
  learning_rate: 5e-5
  warmup_ratio: 0.1
  dataloader_num_workers: 8
  lr_scheduler_type: linear
  greater_is_better: true
  metric_for_best_model: "eval_global/f1-score"
  full_determinism: false
  no_cuda: false
  save_strategy: steps
  save_steps: 3000
  evaluation_strategy: steps
  eval_steps: 3000
  logging_strategy: steps
  logging_steps: 100
  optim: adamw_torch
  remove_unused_columns: false
  report_to: null
  load_best_model_at_end: true

Here's the output of my profiler with the Neptune callback:

And here's the output without the Neptune callback (I explicitly called trainer.remove_callback(NeptuneCallback)):

Siddhant Sadangi · Answer 5 · Mon Jul 17 2023 21:05:34 GMT+0800 (China Standard Time)

Hey @issamemari ,

I see that you have used report_to: null in your config file. It should be none instead of null. Can you update this as check if the issue is resolved?

Here's the relevant API reference: https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/trainer#transformers.TrainingArguments.report_to

Issa Memari · Answer 6 · Mon Jul 17 2023 21:44:23 GMT+0800 (China Standard Time)

Thanks @SiddhantSadangi, I tried setting report_to to none as you suggested. Here's the code and config:

import logging

import hydra
from transformers import Trainer, TrainingArguments
from transformers.integrations import NeptuneCallback

from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor

logger = logging.getLogger(__name__)


@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
    logger.info("Creating dataset")
    train_dataset = LukalficationDataset(**config.dataset.train)
    eval_dataset = LukalficationDataset(**config.dataset.eval)

    logger.info("Creating preprocessor")
    processor = Processor(
        tokenizer_config=config.tokenizer,
        label_encoders=[
            MultiLabelOneHotEncoder(LEVEL1),
            MultiLabelOneHotEncoder(LEVEL2),
            MultiLabelOneHotEncoder(LEVEL3),
        ],
    )

    logger.info("Creating model")
    model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)

    training_args = TrainingArguments(**config.training)

    logger.info("Training model")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=processor.data_collator,
        compute_metrics=MetricsLogger(
            label_encoders=processor.label_encoders
        ).compute_metrics,
        callbacks=[NeptuneCallback(log_checkpoints="best")],
    )

    trainer.train()

    logger.info("Training complete")


if __name__ == "__main__":
    main()

Config:

model:
  freeze_backbone: false
  checkpoint_path: null


dataset:
  train:
    csv_file: ${hydra:runtime.cwd}/data/train.csv
  eval:
    csv_file: ${hydra:runtime.cwd}/data/eval.csv


tokenizer:
  init:
    pretrained_model_name_or_path: "Unbabel/xlm-roberta-comet-small"
    truncation_side: "left"
  call:
    max_length: 512
    padding: longest
    truncation: true
    return_tensors: "pt"


training:
  output_dir: ${hydra:runtime.cwd}/checkpoints/${now:%Y-%m-%d}/${now:%H-%M}
  num_train_epochs: 50
  per_device_train_batch_size: 48
  per_device_eval_batch_size: 48
  learning_rate: 5e-5
  warmup_ratio: 0.1
  dataloader_num_workers: 8
  lr_scheduler_type: linear
  greater_is_better: true
  metric_for_best_model: "eval_global/f1-score"
  full_determinism: false
  no_cuda: false
  save_strategy: steps
  save_steps: 3000
  evaluation_strategy: steps
  eval_steps: 3000
  logging_strategy: steps
  logging_steps: 100
  optim: adamw_torch
  remove_unused_columns: false
  report_to: none
  load_best_model_at_end: true

The training script still seems to be slow, here's the output from the profiler:

Diedre Carmo · Answer 7 · Mon Jul 17 2023 23:43:59 GMT+0800 (China Standard Time)

I just recently updated to the newest neptune version with pytorch lightning 2.0.x and torch 1.12.1 and I see extremly slow training. The GPU usage doesn't go above 5%. When doing a debug pass with logging disabled the gpu is used 100%.

Siddhant Sadangi · Answer 8 · Mon Jul 17 2023 23:46:49 GMT+0800 (China Standard Time)

@issamemari , this should not be happening 🤔

Can you share the size of the logged checkpoint? Or, preferably, a link to the logged run so that we can have a look?
If you are uncomfortable sharing it here, you can send it to support@neptune.ai or send it to us through the in-app chat (at the bottom right).

Siddhant Sadangi · Answer 9 · Mon Jul 17 2023 23:47:38 GMT+0800 (China Standard Time)

Hello @dscarmo ,
Could you please create a new issue with your case's details so we can have a closer look?

Issa Memari · Answer 10 · Tue Jul 18 2023 00:04:41 GMT+0800 (China Standard Time)

I don't think any checkpoints are being logged because I'm always turning the training script off before it reaches 3000 steps (it's too slow). When I look at the Artifacts tab it doesn't show me any artifacts.

Here's the run.

Santiago Castro · Answer 11 · Tue Jul 18 2023 01:38:37 GMT+0800 (China Standard Time)

Hello @bryant1410 ,
Could you please create a new issue with your case's details so we can have a closer look?

I don't have time for it, unfortunately. I'm just not using Neptune for now.

Diedre Carmo · Answer 12 · Tue Jul 18 2023 03:00:07 GMT+0800 (China Standard Time)

Hello @dscarmo , Could you please create a new issue with your case's details so we can have a closer look?

Is there a safe way to remove all cache files and uynsynced information without breaking Neptune?

I am suspecting this is a issue of interaction between .neptune files from older versions. I had a very old version of Neptune and had to update due to updating pytorch lightning. I had to change from neptune-client to neptune.

If the problem persists after removing .neptune I will make an issue

Also, the slow down happens before any checkpoint logging would happen.

Siddhant Sadangi · Answer 13 · Tue Jul 18 2023 03:11:32 GMT+0800 (China Standard Time)

Hello @dscarmo ,

You can use neptune clear to delete all unsynced information in the .neptune folder.

Since you also upgraded from neptune-client to neptune, I am sharing an upgrade guide that might be useful.

Please let me know if this helps.

Siddhant Sadangi · Answer 14 · Tue Jul 18 2023 03:12:58 GMT+0800 (China Standard Time)

@issamemari , I cannot reproduce the issue, but I've asked the team to take a look. I'll keep you updated 🙏

Siddhant Sadangi · Answer 15 · Tue Jul 18 2023 03:13:37 GMT+0800 (China Standard Time)

@bryant1410 - Is it possible to downgrade to the previous version of Neptune?

Diedre Carmo · Answer 16 · Tue Jul 18 2023 03:40:26 GMT+0800 (China Standard Time)

Hello @dscarmo ,

You can use neptune clear to delete all unsynced information in the .neptune folder.

Since you also upgraded from neptune-client to neptune, I am sharing an upgrade guide that might be useful.

Please let me know if this helps.

It seems like I fixed my issue by clearing with Neptune clear, uninstalling Neptune, and installing neptune-client 0.16.16.

Note that neptune clear did not fix anything while still using 1.0.

I think others on this thread might fix their issues by using 0.16.16 too.

As additional information, here are screenshots of gpu usage of the same exact debug run (1 step in training and validation) with 1.0 and with 0.16.16:

1.0

0.16

Clearly something is bottlenecking the main training loop hard, maybe intensive disk operations interfering with IO? Might be the same bottleneck the Issue OP found in his profiling.

Santiago Castro · Answer 17 · Tue Jul 18 2023 03:40:53 GMT+0800 (China Standard Time)

@bryant1410 - Is it possible to downgrade to the previous version of Neptune?

Yes, thanks

Siddhant Sadangi · Answer 18 · Tue Jul 18 2023 03:48:03 GMT+0800 (China Standard Time)

Thanks @dscarmo , the charts would be very helpful in debugging what is happening here.
Having to downgrade is definitely not what we would want you to do, so we will get this checked and fixed ASAP!

@bryant1410 - did downgrading fix your issues as well? Which version did you downgrade to?

Santiago Castro · Answer 19 · Tue Jul 18 2023 04:09:11 GMT+0800 (China Standard Time)

Thanks @dscarmo , the charts would be very helpful in debugging what is happening here. Having to downgrade is definitely not what we would want you to do, so we will get this checked and fixed ASAP!

@bryant1410 - did downgrading fix your issues as well? Which version did you downgrade to?

Haven't tried but if I do, I'll let you know

Siddhant Sadangi · Answer 20 · Tue Jul 18 2023 17:52:03 GMT+0800 (China Standard Time)

Hello @issamemari , @dscarmo , @bryant1410

We believe that changes introduced to better support forking processes in neptune==1.3.2 might be responsible for this.
Can you please try downgrading to neptune==1.3.1 and let us know if it solves the issue?

Adam Nieżurawski · Answer 21 · Tue Jul 18 2023 18:03:42 GMT+0800 (China Standard Time)

@issamemari
Does these profiling results represent real time or CPU time?

Santiago Castro · Answer 22 · Wed Jul 19 2023 03:58:49 GMT+0800 (China Standard Time)

We believe that changes introduced to better support forking processes in neptune==1.3.2 might be responsible for this.
Can you please try downgrading to neptune==1.3.1 and let us know if it solves the issue?

Yeah, I can confirm this occurs to me with 1.3.2 but not with 1.3.1.

Siddhant Sadangi · Answer 23 · Wed Jul 19 2023 15:51:59 GMT+0800 (China Standard Time)

Thanks for confirming @bryant1410 🙏

The fix, for now, would be to downgrade to neptune==1.3.1.
I will update the thread once we release a fix.

Siddhant Sadangi · Answer 24 · Wed Jul 19 2023 20:13:07 GMT+0800 (China Standard Time)

Hello all,

We are having some trouble reproducing this issue.
Would it be possible for anyone to share a reproducible example? Code, data sample, hardware details, etc. would help a lot 🙏

You can send these to support@neptune.ai.

Diedre Carmo · Answer 25 · Wed Jul 19 2023 22:31:09 GMT+0800 (China Standard Time)

Hello all,

We are having some trouble reproducing this issue. Would it be possible for anyone to share a reproducible example? Code, data sample, hardware details, etc. would help a lot pray

You can send these to support@neptune.ai.

I have sent this e-mail a reproduction of the issue in Colab.

Siddhant Sadangi · Answer 26 · Wed Jul 19 2023 22:42:23 GMT+0800 (China Standard Time)

Thanks @dscarmo 🙏

I'll followup via mail if we need more details.

Issa Memari · Answer 27 · Thu Jul 20 2023 15:20:50 GMT+0800 (China Standard Time)

@SiddhantSadangi I confirm that the issue does not happen when using neptune==1.3.1

Robin Lange · Answer 28 · Mon Jul 24 2023 16:10:48 GMT+0800 (China Standard Time)

Just chiming in, I'm encountering the same issue while trying to train a YOLOv5 network. I'm using this commit, with custom data and a simple neptune.init_run() added to train.py to log the progress.

Training is fine with neptune-client==1.3.1 but slows to a crawl with 1.3.2. The training script uses forked data loaders.

Wouter Zwerink · Answer 29 · Mon Jul 24 2023 16:11:26 GMT+0800 (China Standard Time)

Is there any workaround for this aside from downgrading? 1.3.2 fixed the optuna integration for me, but this issue causes it to massively slow down training

Siddhant Sadangi · Answer 30 · Mon Jul 24 2023 17:09:52 GMT+0800 (China Standard Time)

Hey @wouterzwerink ,
We are actively investigating this, but unfortunately, we cannot offer any workaround aside from downgrading to 1.3.1 😞

Regarding the optuna issue, was it the StringifyValue has no len() issue?
If so, could you try this workaround after downgrading to 1.3.1?

Antoine Broyelle · Answer 31 · Tue Jul 25 2023 01:09:50 GMT+0800 (China Standard Time)

I observe the same problem! On a tiny debugging script we go from 20sec to 9min when upgrading from 1.3.1 to 1.3.2 with num_workers>0. The massive slow-down is also present in 1.4.0rc0. Just instantiating the run, without logging anything, is enough to produce this effect. The problem is also present when disabling most tracking features.

            source_files=[],
            capture_stdout=False,
            capture_stderr=False,
            capture_traceback=False,
            fail_on_exception=False,

Siddhant Sadangi · Answer 32 · Tue Jul 25 2023 21:15:16 GMT+0800 (China Standard Time)

Hey all,
We have found a workaround that should work for most cases.

Instead of using fork for multiprocessing, we recommend using spawn or forkserver.
spawn is already the default in MacOS and Windows. forkserver is not supported on Windows.

To use spawn/forkserver, you will need to have a single entry point for your training script:

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")
    train_fn() # or trainer.train(), etc

Here is a minimal example of how you can do this --> Neptune slowdown issue workaround

Note that using spawn/forkserver comes with some limitations. You can check all the limitations here

If you do not want to make these changes, you can update the Neptune connection mode to "offline", and then upload the data to Neptune servers using neptune sync once training is complete.

While we work on a more permanent solution, please let me know if either of these suggestions seem viable workarounds for now 🙏

Siddhant Sadangi · Answer 33 · Wed Jul 26 2023 17:01:03 GMT+0800 (China Standard Time)

Hello all 👋

We have identified the root cause of this issue and have pushed a fix v1.4.1rc1 🎉

With this fix, you will not need to resort to any of the above-mentioned workarounds. ✅

We will be publishing the stable version soon, but if you want to test the pre-release version, you can update neptune by running pip install neptune==1.4.1rc1.

Please let us know if this works 🙏

Siddhant Sadangi · Answer 34 · Thu Jul 27 2023 22:00:16 GMT+0800 (China Standard Time)

We just released v1.4.1 which fixes this 🥳

I'll be closing this issue now, but please feel free to let us know if you are still facing issues.