neptune-ai / neptune-client

πŸ“˜ The MLOps stack component for experiment tracking

Home Page:https://neptune.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NPT-14719: Offline mode messes up plots

wouterzwerink opened this issue Β· comments

Describe the bug

Here's a comparison of the same plot from two different runs, where the only difference is I used offline mode and neptune sync afterwards for the latter. Both are the result of LearningRateMonitor for 4 epochs.
image
image

This also happens in other plots like loss. It seems to duplicate values at the start and end, but sometimes also messes up in between:
image

It does not matter if I use NeptuneLogger or log to the neptune run directly (for loss or metrics, the callback here always uses the NeptuneLogger), the offline version is always messed up

Reproduction

Use offline mode!

Expected behavior

Same plots in neptune regardless of mode

Traceback

If applicable, add traceback or log output/screenshots to help explain your problem.

Environment

The output of pip list:
Tried neptune 1.8.6 and 1.9.1, same results

The operating system you're using:
Linux

The output of python --version:
3.9

Hey @wouterzwerink πŸ‘‹

I am not able to reproduce this. Comparing the plots for async and offline runs gives me perfectly overlapping plots:
image

Could you share a minimal code sample that would help me reproduce this?

Thanks for looking into this. I can try to create a minimal example later.
From the top of my head there is a couple things we do that may be needed to reproduce:

  • Changing the .neptune folder location by temporarily changing directory when initializing the run
  • Using run["some_prefix"] instead of the run object. So we'd assign neptune_run = run["prefix"] and then log like neptune_run["value"].append(score)
  • Use a high flush period (900), though that should not affect offline runs

Still no luck, unfortunately 😞
image

Here's the code I used:

import os

import neptune
import numpy as np

np.random.seed(42)

# Changing `.neptune` folder
original_cwd = os.getcwd()
os.chdir("temp_folder")

run = neptune.init_run(mode="offline", flush_period=900)

os.chdir(original_cwd)

# Logging to namespace_handler
neptune_run = run["prefix"]

for _ in range(100):
    neptune_run["values"].append(np.random.rand())

Oh strange! How are you syncing the offline run?
We call run.stop() followed by a subprocess call to neptune sync --path {path} --project {project} --offline-only

path points to the changed directory, so temp_folder in your case

I was doing it manually from the terminal, but let me try your approach as well

I'll take some time tomorrow to try to isolate the issue, thanks again for looking into this

Same results

import os
import subprocess

import neptune
import numpy as np

np.random.seed(42)

# Changing `.neptune` folder
path = "temp_folder"
original_cwd = os.getcwd()
os.chdir(path)

run = neptune.init_run(mode="offline", flush_period=900)

os.chdir(original_cwd)

# Logging to namespace_handler
neptune_run = run["prefix"]

for _ in range(100):
    neptune_run["values"].append(np.random.rand())

# Stop and sync manually
run.stop()

subprocess.call(f"neptune sync --path {path} --offline-only")

Hi @SiddhantSadangi ! I have a script for you that reproduces the bug on my end:

import os

import neptune
import torch
import torch.nn.functional as F
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.loggers import NeptuneLogger
from torch.utils.data import DataLoader, TensorDataset

PROJECT = "project-name"


class LitModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.layer(x)
        loss = F.cross_entropy(y_hat, y)
        self.log("train/loss", loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)


seed_everything(42)


NUM_SAMPLES = 10_000  # Bug does not happen if this is too small, e.g. 1000
x = torch.randn(NUM_SAMPLES, 32)
y = torch.randint(0, 2, (NUM_SAMPLES,))
dataset = TensorDataset(x, y)


def get_dataloader():
    # Bug does not happen if num_workers=0
    return DataLoader(dataset, batch_size=16, num_workers=4)


run = neptune.Run(
    project=PROJECT,
    mode="offline",  # Bug only happens in offline mode, not with sync or async
    flush_period=900,
    capture_hardware_metrics=False,  # Bug does not happen if these are enabled
    capture_stdout=False,
    capture_stderr=False,
    capture_traceback=False,
)


for prefix in ("prefix1", "prefix2"):
    logger = NeptuneLogger(
        run=run[prefix],
        log_model_checkpoints=False,
    )

    model = LitModel()
    dataloader = get_dataloader()
    trainer = Trainer(
        logger=logger,
        max_epochs=4,
    )
    trainer.fit(model, dataloader)

# Stop and sync manually
run.stop()
os.system(f"neptune sync --project {PROJECT} --offline-only")

Hey @wouterzwerink ,
Thanks for sharing the script!

I am, however, still not being able to reproduce the issue. I ran your script as is for the offline mode, and once with the default async mode, and got perfectly overlapping charts
image

Is anyone else in your team also facing the same issue?

@SiddhantSadangi interesting, can't seem to find whats causing this! What python version are you using? I'm on 3.9.18 and latest neptune and lightning.
I don't think anyone else is trying to use offline mode right now

I switched to WSL to use multiple dataloaders and forgot it was on Python 3.11.5. Let me try 3.9.18 too

Same results with Python 3.9.18, neptune 1.9.1, and lightning 2.2.1

image

Would it be possible for someone else in your team to try running in offline mode? It'll help us know if it's something specific to your setup, or something to do with the client in general

@SiddhantSadangi I'll ask someone else to run it too.

I found something interesting. Adding the following fixes the issue for me:

    def on_train_epoch_start(self) -> None:
        root_obj = self.logger.experiment.get_root_object()
        root_obj.wait(disk_only=True)

This fix works not only in the script, but also my actual code.

Perhaps this is some race condition.
My .neptune is on AWS EFS, a network filesystem, so writing to disk may be slower on my side which could explain why it's not reproducing on your side.

So after struggling with AWS' security groups, I tried running your code on EC2 with a mounted EFS volume that served as the NEPTUNE_DATA_DIRECTORY.. But I was still unable to reproduce the issue πŸ˜”

image

I will still have the engineers take a look in case the lag between writing to memory and flushing to disk might be causing some weird issues.

@wouterzwerink - could you mail us the contents of the .neptune folder as a ZIP archive to support@neptune.ai?

Also, would it be possible for you to run neptune sync after the script has terminated? Maybe from the terminal or something?

Also, would it be possible for you to run neptune sync after the script has terminated? Maybe from the terminal or something?

Sure thing! Just did neptune clear, then the script with the os.system call removed, then manually synced with neptune sync. Results are the same:
image

@wouterzwerink - could you mail us the contents of the .neptune folder as a ZIP archive to support@neptune.ai?

@wouterzwerink - Could you also share this?