microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU fairness usage

scarlett2018 opened this issue · comments

Scenario
There are low utilization jobs which might block others job from submission. We'd like to have a service plugin which can:

  • detect all the jobs' utilization
  • notify users with low utilization in recent few days (default 20% in 5 days, customization)
  • if user has justification for the usage of the job, admin can extend the job lifetime. otherwise the low utilization jobs will be killed automaticaly in 1 day.

Another alternative implementation is: provide a incentive model with bonus tokens to the user, and let the user decide how to spend the gpu hours.

  • Add job start time and GPU hours in Job utilization. Currently, rest-server only return job submission time and job completion time. Doesn't return job start running time. Refer to #4295
  • Change user GPU utilization to weighted average. Since currently restAPI return job duration based on completion-time - submission-time not completion-time-start-running-time. The weighted average might not correct. Refer to #4295
  • Add a date info to the email notification's title. i.e. from "pai cluster utilization" to "pai cluster utilization - 3.17"
  • Add a status column for the job status at the moment of report generated
  • Add a GPU count column for the GPU used by the job
  • Enable debugging mode for debug VC.
    (debugging mode: users can SSH the node and use for debug within 1~2 hours, system will automatically disconnect the node when time is up.)

  • Prototype for Enable cluster level policy for job management
    (the prototype: Disable SSH port. Jobs will be automatically killed if their utilization is continuously lower than 20% in 1~2 hours)