NPT-14719: Offline mode messes up plots

Question

NPT-14719: Offline mode messes up plots

wouterzwerink opened this issue 3 months ago · comments

Describe the bug

Here's a comparison of the same plot from two different runs, where the only difference is I used offline mode and neptune sync afterwards for the latter. Both are the result of LearningRateMonitor for 4 epochs.

This also happens in other plots like loss. It seems to duplicate values at the start and end, but sometimes also messes up in between:

It does not matter if I use NeptuneLogger or log to the neptune run directly (for loss or metrics, the callback here always uses the NeptuneLogger), the offline version is always messed up

Reproduction

Use offline mode!

Expected behavior

Same plots in neptune regardless of mode

Traceback

If applicable, add traceback or log output/screenshots to help explain your problem.

Environment

The output of pip list:
Tried neptune 1.8.6 and 1.9.1, same results

The operating system you're using:
Linux

The output of python --version:
3.9

Siddhant Sadangi · Answer 1 · Tue Mar 05 2024 23:17:24 GMT+0800 (China Standard Time)

Hey @wouterzwerink 👋

I am not able to reproduce this. Comparing the plots for async and offline runs gives me perfectly overlapping plots:

Could you share a minimal code sample that would help me reproduce this?

Wouter Zwerink · Answer 2 · Tue Mar 05 2024 23:31:29 GMT+0800 (China Standard Time)

Thanks for looking into this. I can try to create a minimal example later.
From the top of my head there is a couple things we do that may be needed to reproduce:

Changing the .neptune folder location by temporarily changing directory when initializing the run
Using run["some_prefix"] instead of the run object. So we'd assign neptune_run = run["prefix"] and then log like neptune_run["value"].append(score)
Use a high flush period (900), though that should not affect offline runs

Siddhant Sadangi · Answer 3 · Tue Mar 05 2024 23:42:38 GMT+0800 (China Standard Time)

Still no luck, unfortunately 😞

Here's the code I used:

import os

import neptune
import numpy as np

np.random.seed(42)

# Changing `.neptune` folder
original_cwd = os.getcwd()
os.chdir("temp_folder")

run = neptune.init_run(mode="offline", flush_period=900)

os.chdir(original_cwd)

# Logging to namespace_handler
neptune_run = run["prefix"]

for _ in range(100):
    neptune_run["values"].append(np.random.rand())

Wouter Zwerink · Answer 4 · Tue Mar 05 2024 23:56:18 GMT+0800 (China Standard Time)

Oh strange! How are you syncing the offline run?
We call run.stop() followed by a subprocess call to neptune sync --path {path} --project {project} --offline-only

path points to the changed directory, so temp_folder in your case

Siddhant Sadangi · Answer 5 · Tue Mar 05 2024 23:58:25 GMT+0800 (China Standard Time)

I was doing it manually from the terminal, but let me try your approach as well

Wouter Zwerink · Answer 6 · Tue Mar 05 2024 23:59:34 GMT+0800 (China Standard Time)

I'll take some time tomorrow to try to isolate the issue, thanks again for looking into this

Siddhant Sadangi · Answer 7 · Wed Mar 06 2024 00:03:10 GMT+0800 (China Standard Time)

Same results

import os
import subprocess

import neptune
import numpy as np

np.random.seed(42)

# Changing `.neptune` folder
path = "temp_folder"
original_cwd = os.getcwd()
os.chdir(path)

run = neptune.init_run(mode="offline", flush_period=900)

os.chdir(original_cwd)

# Logging to namespace_handler
neptune_run = run["prefix"]

for _ in range(100):
    neptune_run["values"].append(np.random.rand())

# Stop and sync manually
run.stop()

subprocess.call(f"neptune sync --path {path} --offline-only")

Wouter Zwerink · Answer 8 · Sun Mar 10 2024 23:28:05 GMT+0800 (China Standard Time)

Hi @SiddhantSadangi ! I have a script for you that reproduces the bug on my end:

import os

import neptune
import torch
import torch.nn.functional as F
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.loggers import NeptuneLogger
from torch.utils.data import DataLoader, TensorDataset

PROJECT = "project-name"


class LitModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.layer(x)
        loss = F.cross_entropy(y_hat, y)
        self.log("train/loss", loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)


seed_everything(42)


NUM_SAMPLES = 10_000  # Bug does not happen if this is too small, e.g. 1000
x = torch.randn(NUM_SAMPLES, 32)
y = torch.randint(0, 2, (NUM_SAMPLES,))
dataset = TensorDataset(x, y)


def get_dataloader():
    # Bug does not happen if num_workers=0
    return DataLoader(dataset, batch_size=16, num_workers=4)


run = neptune.Run(
    project=PROJECT,
    mode="offline",  # Bug only happens in offline mode, not with sync or async
    flush_period=900,
    capture_hardware_metrics=False,  # Bug does not happen if these are enabled
    capture_stdout=False,
    capture_stderr=False,
    capture_traceback=False,
)


for prefix in ("prefix1", "prefix2"):
    logger = NeptuneLogger(
        run=run[prefix],
        log_model_checkpoints=False,
    )

    model = LitModel()
    dataloader = get_dataloader()
    trainer = Trainer(
        logger=logger,
        max_epochs=4,
    )
    trainer.fit(model, dataloader)

# Stop and sync manually
run.stop()
os.system(f"neptune sync --project {PROJECT} --offline-only")

Siddhant Sadangi · Answer 9 · Mon Mar 11 2024 17:18:55 GMT+0800 (China Standard Time)

Hey @wouterzwerink ,
Thanks for sharing the script!

I am, however, still not being able to reproduce the issue. I ran your script as is for the offline mode, and once with the default async mode, and got perfectly overlapping charts

Is anyone else in your team also facing the same issue?

Wouter Zwerink · Answer 10 · Mon Mar 11 2024 18:01:55 GMT+0800 (China Standard Time)

@SiddhantSadangi interesting, can't seem to find whats causing this! What python version are you using? I'm on 3.9.18 and latest neptune and lightning.
I don't think anyone else is trying to use offline mode right now

Siddhant Sadangi · Answer 11 · Mon Mar 11 2024 18:16:52 GMT+0800 (China Standard Time)

I switched to WSL to use multiple dataloaders and forgot it was on Python 3.11.5. Let me try 3.9.18 too

Siddhant Sadangi · Answer 12 · Mon Mar 11 2024 18:27:29 GMT+0800 (China Standard Time)

Same results with Python 3.9.18, neptune 1.9.1, and lightning 2.2.1

Siddhant Sadangi · Answer 13 · Mon Mar 11 2024 18:28:31 GMT+0800 (China Standard Time)

Would it be possible for someone else in your team to try running in offline mode? It'll help us know if it's something specific to your setup, or something to do with the client in general

Wouter Zwerink · Answer 14 · Mon Mar 11 2024 20:27:20 GMT+0800 (China Standard Time)

@SiddhantSadangi I'll ask someone else to run it too.

I found something interesting. Adding the following fixes the issue for me:

    def on_train_epoch_start(self) -> None:
        root_obj = self.logger.experiment.get_root_object()
        root_obj.wait(disk_only=True)

This fix works not only in the script, but also my actual code.

Perhaps this is some race condition.
My .neptune is on AWS EFS, a network filesystem, so writing to disk may be slower on my side which could explain why it's not reproducing on your side.

Siddhant Sadangi · Answer 15 · Tue Mar 12 2024 02:16:30 GMT+0800 (China Standard Time)

So after struggling with AWS' security groups, I tried running your code on EC2 with a mounted EFS volume that served as the NEPTUNE_DATA_DIRECTORY.. But I was still unable to reproduce the issue 😔

I will still have the engineers take a look in case the lag between writing to memory and flushing to disk might be causing some weird issues.

Siddhant Sadangi · Answer 16 · Tue Mar 12 2024 17:09:37 GMT+0800 (China Standard Time)

@wouterzwerink - could you mail us the contents of the .neptune folder as a ZIP archive to support@neptune.ai?

Siddhant Sadangi · Answer 17 · Tue Mar 12 2024 17:25:16 GMT+0800 (China Standard Time)

Also, would it be possible for you to run neptune sync after the script has terminated? Maybe from the terminal or something?

Wouter Zwerink · Answer 18 · Tue Mar 12 2024 18:38:13 GMT+0800 (China Standard Time)

Also, would it be possible for you to run neptune sync after the script has terminated? Maybe from the terminal or something?

Sure thing! Just did neptune clear, then the script with the os.system call removed, then manually synced with neptune sync. Results are the same:

Siddhant Sadangi · Answer 19 · Tue Mar 12 2024 19:22:08 GMT+0800 (China Standard Time)

@wouterzwerink - could you mail us the contents of the .neptune folder as a ZIP archive to support@neptune.ai?

@wouterzwerink - Could you also share this?