google-deepmind / reverb

Reverb is an efficient and easy-to-use data storage and transport system designed for machine learning research

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to set up reverb server and client on different machines

JCMiles opened this issue · comments

Hello team. I have a simple cluster of 3 machines under a proprietary VLAN configuration.
The main one (address 10.10.20.1) and 2 workers (10.10.20.5 and 10.10.20.6)
In the first machine I run 2 separate processes.

  1. The reverb server by passing 8080 as a port
  2. The Learner class which sets the Reverb Replay Buffer at localhost:8080.

Each worker performs data collection and and sets the reverb.Client() by passing the address of the first machine as 10.10.20.1:8080. Data collection runs smoothly but I was expecting the reverb checkpoint to be saved on the first machine so the learner class can pick it up and starts the training loop. But instead checkpoints are saved respectively in each worker so the Learner idle forever waiting for data to consume. I did some research online but I found only examples on how to set up reverb on the same machine. Am I doing something wrong in the configuration?
Any help is really appreciated.

Please any update on this?

Checkpointing happens on the machine running Reverb Server. API for checkpointing returns location of the checkpoint to the client. What is the exact path your checkpoints are created under? Please provide ls -la output for that directory.
Also, what is running at localhost:8008? As I understand, Reverb Server runs under port 8080, what is port 8008?