vwxyzjn / cleanba

CleanRL's implementation of DeepMind's Podracer Sebulba Architecture for Distributed DRL

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segfault

vwxyzjn opened this issue · comments

python -m cleanrl_utils.benchmark \
    --env-ids Breakout-v5 \
    --command "poetry run python cleanba/cleanba_ppo_envpool_impala_atari_wrapper.py --exp-name cleanba_ppo_envpool_impala_atari_wrapper_a0_l1+2+3_d32 --distributed --total-timesteps 100000000 --anneal-lr False --learner-device-ids 1 2 3 --track --wandb-project-name cleanba" \
    --num-seeds 1 \
    --workers 3 \
    --slurm-gpus-per-task 4 \
    --slurm-ntasks 32 \
    --slurm-total-cpus 960 \
    --slurm-template-path cleanba.slurm_template

Produces the following error, but it's not always reproducible.

Running task 0 with env_id: Breakout-v5 and seed: 1
wandb: Currently logged in as: costa-huang (openrlbenchmark). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.13.10
wandb: Run data is saved locally in /admin/home-costa/cleanba/wandb/run-20230307_160455-ir6iswuf
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run Breakout-v5__cleanba_ppo_envpool_impala_atari_wrapper_a0_l1+2+3_d32__1__455adc4b-68f2-49b5-b08b-85cead7657d8
wandb: ⭐️ View project at https://wandb.ai/openrlbenchmark/cleanba
wandb: 🚀 View run at https://wandb.ai/openrlbenchmark/cleanba/runs/ir6iswuf
srun: error: ip-26-0-141-178: task 15: Segmentation fault
srun: error: ip-26-0-141-217: task 16: Segmentation fault
srun: error: ip-26-0-141-217: task 17: Segmentation fault
2023-03-07 16:06:30.684022: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:956] /job:jax_worker/replica:0/task:15 has been set to ERROR in coordination service: UNAVAILABLE: Task /job:jax_worker/replica:0/task:15 heartbeat timeout. This indicates that the remote task has failed, got preempted, or crashed unexpectedly. [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-03-07 16:06:30.684105: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:956] /job:jax_worker/replica:0/task:16 has been set to ERROR in coordination service: UNAVAILABLE: Task /job:jax_worker/replica:0/task:16 heartbeat timeout. This indicates that the remote task has failed, got preempted, or crashed unexpectedly. [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-03-07 16:06:30.684126: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:956] /job:jax_worker/replica:0/task:17 has been set to ERROR in coordination service: UNAVAILABLE: Task /job:jax_worker/replica:0/task:17 heartbeat timeout. This indicates that the remote task has failed, got preempted, or crashed unexpectedly. [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-03-07 16:06:30.684140: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:411] Stopping coordination service as heartbeat has timed out for /job:jax_worker/replica:0/task:15 and there is no service-to-client connection
2023-03-07 16:07:19.683537: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:711] Coordination agent is in ERROR: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1678205239.683296695","description":"Error received from peer ipv4:26.0.141.128:61939","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-03-07 16:07:19.683623: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:452] Coordination service agent in error status: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1678205239.683296695","description":"Error received from peer ipv4:26.0.141.128:61939","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-03-07 16:07:19.683670: F external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.h:75] Terminating process because the coordinator detected missing heartbeats. This most likely indicates that another task died; see the other task logs for more details. Status: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1678205239.683296695","description":"Error received from peer ipv4:26.0.141.128:61939","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-03-07 16:07:20.194646: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:711] Coordination agent is in ERROR: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1678205240.194404984","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3940,"referenced_errors":[{"created":"@1678205240.190710821","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":392,"grpc_status":14}]}
2023-03-07 16:07:20.194713: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:452] Coordination service agent in error status: UNAVAILABLE: failed to connect to all addresses
2023-03-07 16:07:20.193225: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:711] Coordination agent is in ERROR: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1678205240.193003968","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3940,"referenced_errors":[{"created":"@1678205240.189222724","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":392,"grpc_status":14}]}
2023-03-07 16:07:20.193273: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:452] Coordination service agent in error status: UNAVAILABLE: failed to connect to all addresses
srun: error: ip-26-0-141-247: task 21: Aborted
srun: error: ip-26-0-142-13: task 24: Aborted
srun: error: ip-26-0-141-178: task 14: Aborted
srun: error: ip-26-0-141-146: task 6: Aborted
srun: error: ip-26-0-142-24: task 29: Aborted
srun: error: ip-26-0-142-29: task 30: Aborted
srun: error: ip-26-0-141-132: task 3: Aborted
srun: error: ip-26-0-141-166: task 13: Aborted
srun: error: ip-26-0-141-140: task 5: Aborted
srun: error: ip-26-0-142-3: task 22: Aborted
srun: error: ip-26-0-142-21: task 27: Aborted
srun: error: ip-26-0-141-247: task 20: Aborted
srun: error: ip-26-0-141-161: task 10: Aborted
srun: error: ip-26-0-142-29: task 31: Aborted
srun: error: ip-26-0-141-157: task 9: Aborted
srun: error: ip-26-0-141-166: task 12: Aborted
srun: error: ip-26-0-142-13: task 25: Aborted
srun: error: ip-26-0-142-24: task 28: Aborted
srun: error: ip-26-0-141-146: task 7: Aborted
srun: error: ip-26-0-141-132: task 2: Aborted
srun: error: ip-26-0-141-140: task 4: Aborted
srun: error: ip-26-0-142-3: task 23: Aborted
srun: error: ip-26-0-141-228: task 19: Aborted
srun: error: ip-26-0-142-21: task 26: Aborted
srun: error: ip-26-0-141-161: task 11: Aborted
srun: error: ip-26-0-141-157: task 8: Aborted
srun: error: ip-26-0-141-228: task 18: Aborted
srun: error: ip-26-0-141-128: task 1: Aborted