NPT-14719: Offline mode messes up plots
wouterzwerink opened this issue Β· comments
Describe the bug
Here's a comparison of the same plot from two different runs, where the only difference is I used offline mode and neptune sync
afterwards for the latter. Both are the result of LearningRateMonitor
for 4 epochs.
This also happens in other plots like loss. It seems to duplicate values at the start and end, but sometimes also messes up in between:
It does not matter if I use NeptuneLogger
or log to the neptune run directly (for loss or metrics, the callback here always uses the NeptuneLogger), the offline version is always messed up
Reproduction
Use offline mode!
Expected behavior
Same plots in neptune regardless of mode
Traceback
If applicable, add traceback or log output/screenshots to help explain your problem.
Environment
The output of pip list
:
Tried neptune 1.8.6 and 1.9.1, same results
The operating system you're using:
Linux
The output of python --version
:
3.9
Hey @wouterzwerink π
I am not able to reproduce this. Comparing the plots for async and offline runs gives me perfectly overlapping plots:
Could you share a minimal code sample that would help me reproduce this?
Thanks for looking into this. I can try to create a minimal example later.
From the top of my head there is a couple things we do that may be needed to reproduce:
- Changing the
.neptune
folder location by temporarily changing directory when initializing the run - Using
run["some_prefix"]
instead of the run object. So we'd assignneptune_run = run["prefix"]
and then log likeneptune_run["value"].append(score)
- Use a high flush period (900), though that should not affect offline runs
Still no luck, unfortunately π
Here's the code I used:
import os
import neptune
import numpy as np
np.random.seed(42)
# Changing `.neptune` folder
original_cwd = os.getcwd()
os.chdir("temp_folder")
run = neptune.init_run(mode="offline", flush_period=900)
os.chdir(original_cwd)
# Logging to namespace_handler
neptune_run = run["prefix"]
for _ in range(100):
neptune_run["values"].append(np.random.rand())
Oh strange! How are you syncing the offline run?
We call run.stop()
followed by a subprocess call to neptune sync --path {path} --project {project} --offline-only
path
points to the changed directory, so temp_folder
in your case
I was doing it manually from the terminal, but let me try your approach as well
I'll take some time tomorrow to try to isolate the issue, thanks again for looking into this
Same results
import os
import subprocess
import neptune
import numpy as np
np.random.seed(42)
# Changing `.neptune` folder
path = "temp_folder"
original_cwd = os.getcwd()
os.chdir(path)
run = neptune.init_run(mode="offline", flush_period=900)
os.chdir(original_cwd)
# Logging to namespace_handler
neptune_run = run["prefix"]
for _ in range(100):
neptune_run["values"].append(np.random.rand())
# Stop and sync manually
run.stop()
subprocess.call(f"neptune sync --path {path} --offline-only")
Hi @SiddhantSadangi ! I have a script for you that reproduces the bug on my end:
import os
import neptune
import torch
import torch.nn.functional as F
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.loggers import NeptuneLogger
from torch.utils.data import DataLoader, TensorDataset
PROJECT = "project-name"
class LitModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.layer(x)
loss = F.cross_entropy(y_hat, y)
self.log("train/loss", loss)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
seed_everything(42)
NUM_SAMPLES = 10_000 # Bug does not happen if this is too small, e.g. 1000
x = torch.randn(NUM_SAMPLES, 32)
y = torch.randint(0, 2, (NUM_SAMPLES,))
dataset = TensorDataset(x, y)
def get_dataloader():
# Bug does not happen if num_workers=0
return DataLoader(dataset, batch_size=16, num_workers=4)
run = neptune.Run(
project=PROJECT,
mode="offline", # Bug only happens in offline mode, not with sync or async
flush_period=900,
capture_hardware_metrics=False, # Bug does not happen if these are enabled
capture_stdout=False,
capture_stderr=False,
capture_traceback=False,
)
for prefix in ("prefix1", "prefix2"):
logger = NeptuneLogger(
run=run[prefix],
log_model_checkpoints=False,
)
model = LitModel()
dataloader = get_dataloader()
trainer = Trainer(
logger=logger,
max_epochs=4,
)
trainer.fit(model, dataloader)
# Stop and sync manually
run.stop()
os.system(f"neptune sync --project {PROJECT} --offline-only")
Hey @wouterzwerink ,
Thanks for sharing the script!
I am, however, still not being able to reproduce the issue. I ran your script as is for the offline mode, and once with the default async
mode, and got perfectly overlapping charts
Is anyone else in your team also facing the same issue?
@SiddhantSadangi interesting, can't seem to find whats causing this! What python version are you using? I'm on 3.9.18 and latest neptune and lightning.
I don't think anyone else is trying to use offline mode right now
I switched to WSL to use multiple dataloaders and forgot it was on Python 3.11.5. Let me try 3.9.18 too
Would it be possible for someone else in your team to try running in offline mode? It'll help us know if it's something specific to your setup, or something to do with the client in general
@SiddhantSadangi I'll ask someone else to run it too.
I found something interesting. Adding the following fixes the issue for me:
def on_train_epoch_start(self) -> None:
root_obj = self.logger.experiment.get_root_object()
root_obj.wait(disk_only=True)
This fix works not only in the script, but also my actual code.
Perhaps this is some race condition.
My .neptune
is on AWS EFS, a network filesystem, so writing to disk may be slower on my side which could explain why it's not reproducing on your side.
So after struggling with AWS' security groups, I tried running your code on EC2 with a mounted EFS volume that served as the NEPTUNE_DATA_DIRECTORY
.. But I was still unable to reproduce the issue π
I will still have the engineers take a look in case the lag between writing to memory and flushing to disk might be causing some weird issues.
@wouterzwerink - could you mail us the contents of the .neptune
folder as a ZIP archive to support@neptune.ai?
Also, would it be possible for you to run neptune sync
after the script has terminated? Maybe from the terminal or something?
@wouterzwerink - could you mail us the contents of the
.neptune
folder as a ZIP archive to support@neptune.ai?
@wouterzwerink - Could you also share this?