How can I setup Visdom on a remote server using slurm?
neuronphysics opened this issue · comments
I want to use visdom
to visualize the results of my trained deep learning algorithm which has been running on a remote cluster server. First I am wondering whether I should use special command line to connect via ssh
to the cluster or not to be able to see the visdom plots?
In my slurm script I used the following command line:
python -u script.py --visdom_server "http://ncc1.clients.dur.ac.uk" --visdom_port 8098
and in my python script
#Plotting on remote server
import visdom
cfg = {"server": "ncc1.clients.dur.ac.uk",
"port": 8098}
vis = visdom.Visdom('http://' + cfg["server"], port = cfg["port"])
win = None
def update_viz(epoch, loss, title):
global win
if win is None:
title = title
win = viz.line(
X=np.array([epoch]),
Y=np.array([loss]),
win=title,
opts=dict(
title=title,
fillarea=True
)
)
else:
viz.line(
X=np.array([epoch]),
Y=np.array([loss]),
win=win,
update='append'
)
I got this error:
requests.exceptions.InvalidURL: Failed to parse: http://http::8098/env/main
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Visdom python client failed to establish socket to get messages from the server. This feature is optional and can be disabl
ed by initializing Visdom with `use_incoming_socket=False`, which will prevent waiting for this request to timeout.
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
script.py:41: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().d
etach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
params['w'].append(nn.Parameter(torch.tensor(Normal(torch.zeros(n_in, n_out), std * torch.ones(n_in, n_out)).rsample(), r
equires_grad=True, device=device)))
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
script.py:42: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().d
etach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
params['b'].append(nn.Parameter(torch.tensor(torch.mul(bias_init, torch.ones([n_out,])), requires_grad=True, device=devic
e)))
script.py:292: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().
detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
return torch.exp(torch.lgamma(torch.tensor(a, dtype=torch.float, requires_grad=True).to(device=local_device)) + torch.lga
mma(torch.tensor(b, dtype=torch.float, requires_grad=True).to(device=local_device)) - torch.lgamma(torch.tensor(a+b, dtype=
torch.float, requires_grad=True).to(device=local_device)))
script.py:679: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /opt/conda/conda-bld/pytorch_1631630815121/work/torch
/csrc/utils/python_arg_parser.cpp:1025.)
exp_avg.mul_(beta1).add_(1 - beta1, grad)
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Traceback (most recent call last):
File "script.py", line 871, in <module>
update_viz(epoch, elbo2.item(),' Loss by Epoch')
File "script.py", line 736, in update_viz
win = viz.line(
NameError: name 'viz' is not defined
How can I run my plotting script on a remote server? Is there anyway to do this? Thanks.
Hi @neuronphysics, one way to manage this kind of setup is with an ssh tunnel, such that you can still log to localhost
at the port you tunnel. This isn't required to get a remote server working, however it does make the semantics equivalent to if you run the server and the plotting script on the same machine.
That being said, it seems something isn't quite right with your underlying setup:
Failed to parse: http://http::8098/env/main
You can see here how we parse the incoming domain and configuration details:
Lines 392 to 405 in 026958a
It might be worthwhile to add some print statements to understand why it is we're parsing out http://http::8098/env/main
as the final address, rather than the http://ncc1.clients.dur.ac.uk:8098/env/main
you may expect.