GPU fairness usage
scarlett2018 opened this issue · comments
Scenario
There are low utilization jobs which might block others job from submission. We'd like to have a service plugin which can:
- detect all the jobs' utilization
- notify users with low utilization in recent few days (default 20% in 5 days, customization)
- if user has justification for the usage of the job, admin can extend the job lifetime. otherwise the low utilization jobs will be killed automaticaly in 1 day.
Another alternative implementation is: provide a incentive model with bonus tokens to the user, and let the user decide how to spend the gpu hours.
- Add job start time and GPU hours in Job utilization. Currently, rest-server only return job submission time and job completion time. Doesn't return job start running time. Refer to #4295
- Change user GPU utilization to weighted average. Since currently restAPI return job duration based on
completion-time - submission-time
notcompletion-time-start-running-time
. The weighted average might not correct. Refer to #4295 - Add a date info to the email notification's title. i.e. from "pai cluster utilization" to "pai cluster utilization - 3.17"
- Add a status column for the job status at the moment of report generated
- Add a GPU count column for the GPU used by the job
-
Enable debugging mode for debug VC.
(debugging mode: users can SSH the node and use for debug within 1~2 hours, system will automatically disconnect the node when time is up.) -
Prototype for Enable cluster level policy for job management
(the prototype: Disable SSH port. Jobs will be automatically killed if their utilization is continuously lower than 20% in 1~2 hours)