GPU fairness usage

Question

scarlett2018 opened this issue 4 years ago · comments

Scenario
There are low utilization jobs which might block others job from submission. We'd like to have a service plugin which can:

detect all the jobs' utilization
notify users with low utilization in recent few days (default 20% in 5 days, customization)
if user has justification for the usage of the job, admin can extend the job lifetime. otherwise the low utilization jobs will be killed automaticaly in 1 day.

Another alternative implementation is: provide a incentive model with bonus tokens to the user, and let the user decide how to spend the gpu hours.

Binyang Li · Answer 1 · Tue Mar 17 2020 15:59:12 GMT+0800 (China Standard Time)

Add job start time and GPU hours in Job utilization. Currently, rest-server only return job submission time and job completion time. Doesn't return job start running time. Refer to #4295
Change user GPU utilization to weighted average. Since currently restAPI return job duration based on completion-time - submission-time not completion-time-start-running-time. The weighted average might not correct. Refer to #4295
Add a date info to the email notification's title. i.e. from "pai cluster utilization" to "pai cluster utilization - 3.17"
Add a status column for the job status at the moment of report generated
Add a GPU count column for the GPU used by the job

Scarlett Li · Answer 2 · Tue Mar 17 2020 16:18:10 GMT+0800 (China Standard Time)

Enable debugging mode for debug VC.
(debugging mode: users can SSH the node and use for debug within 1~2 hours, system will automatically disconnect the node when time is up.)
Prototype for Enable cluster level policy for job management
(the prototype: Disable SSH port. Jobs will be automatically killed if their utilization is continuously lower than 20% in 1~2 hours)