google / paxml

Pax is a Jax-based machine learning framework for training large scale models. Pax allows for advanced and fully configurable experimentation and parallelization, and has demonstrated industry leading model flop utilization rates.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DEADLINE_EXCEEDED on 1024 GPUs.

mhugues opened this issue · comments

Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/RegisterTask:
:{"created":"@1712965181.656280441","description":"Deadline Exceeded","file":"external/com_github_grpc_grpc/src/core/ext/filters/deadline/deadline_filter.cc","file_line":69,"grpc_status":4}
2024-04-12 23:39:41.656900: E external/xla/xla/pjrt/distributed/client.cc:96] Coordination service agent in error status: DEADLINE_EXCEEDED: Deadline Exceeded

Did anyone see that issue?