bytedance / byteps

A high performance and generic framework for distributed DNN training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured.

jackjinj opened this issue · comments

Describe the bug
A clear and concise description of what the bug is.
The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured. So the bpslaunch complain it can't find DMLC_WORKER_ID variable and error out.

To Reproduce
Steps to reproduce the behavior:

  1. Prepared Kubernetes 1.19
  2. Installed kubeflow 1.2 which has mxjob operator
  3. Download the yaml from https://github.com/kubeflow/mxnet-operator/blob/master/examples/train/byteps_dist_gpu_v1.yaml
  4. kubectl apply -f byteps_dist_gpu_v1.yaml
  5. kubect get pod:
    byteps-mxnet-job-scheduler-0 1/1 Running 0 8s
    byteps-mxnet-job-server-0 1/1 Running 0 8s
    byteps-mxnet-job-server-1 1/1 Running 0 8s
    byteps-mxnet-job-worker-0 0/1 Completed 0 8s
    byteps-mxnet-job-worker-1 0/1 Completed 0 7s

$ kubectl describe pod byteps-mxnet-job-worker-0
You can see DMLC_WORKER_ID is not there
DMLC_PS_ROOT_PORT: 9091
DMLC_PS_ROOT_URI: byteps-mxnet-job-scheduler-0
DMLC_NUM_SERVER: 2
DMLC_NUM_WORKER: 2
DMLC_ROLE: worker
DMLC_USE_KUBERNETES: 1

To reproduce it inside the Pod, you can modify the yaml as below to let the Pod run without running bpslanuch
command: ["/bin/bash", "-c"]
args: [
"sleep 3600"
]

command: ["bpslaunch"]

args: ["python3", "/usr/local/byteps/example/mxnet/train_imagenet_byteps.py", "--benchmark", "1", "--batch-size=32"]

Then apply the yaml to let the Pod run:
byteps-mxnet-job-server-0 1/1 Running 0 15s
byteps-mxnet-job-server-1 1/1 Running 0 15s
byteps-mxnet-job-worker-0 1/1 Running 0 15s
byteps-mxnet-job-worker-1 1/1 Running 0 14s

Then login as below:
$ kubectl exec -it byteps-mxnet-job-worker-0 -- bash
root@byteps-mxnet-job-worker-0:/#
root@byteps-mxnet-job-worker-0:/# env |grep DMLC_WORKER_ID
root@byteps-mxnet-job-worker-0:/# bpslaunch
BytePS launching worker
The env DMLC_WORKER_ID is missing

Expected behavior
A clear and concise description of what you expected to happen.
Expect to see the worker pod running

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS:
  • GCC version:
  • CUDA and NCCL version:
  • Framework (TF, PyTorch, MXNet):

Additional context
Add any other context about the problem here.

If I need to run Pytorch DDP with byteps in kubernetes platform, do I still have to use mxjob operator? or I can use PytorchJob operator?

Thanks

Jack