microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for different hardware configurations for different task roles of one distributed job.

siaimes opened this issue · comments

What would you like to be added:
Support for different hardware configurations for different task roles of one distributed job.

Why is this needed:
For complex learning tasks, the programs that need to run on each computer are very different, and the requirements for CPU /GPU and RAM /GPU memory are also different. At the same time, these computers need to communicate with each other to enable joint training. For example, in reinforcement learning, the entire reinforcement learning algorithm consists of different modules. The actor uses the GPU to generate data, the learner uses the GPU to train data, the environment and MCTS use CPU to generate data in parallel, and these modules involve complex data communication.

Without this feature, how does the current module work:
Reinforcement learning tasks cannot be performed jointly by multiple computers.

Components that may involve changes:
Job protocol and related.

Downgrade vc to taskrole:
image

Allows each taskrole to have a different skutype:
image