neilmunday / slurm-mail

Slurm-Mail is a drop in replacement for Slurm's e-mails to give users much more information about their jobs compared to the standard Slurm e-mails.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Using scontrol for jobs that were canceled yields "invalid job id"

jcklie opened this issue · comments

Versions

OS version: Centos 7
Slurm version: 22.05.02
Slurm Mail version: 4.1 Snapshot

Describe the bug

Using scontrol for jobs that were canceled yields "invalid job id"

Logs

2022/11/02 12:51:03:INFO: processing: /var/spool/slurm-mail/31178_1667381825.054817.mail
2022/11/02 12:51:03:WARNING: job 31178: could not parse 'None' for job start timestamp
2022/11/02 12:51:03:ERROR: Failed to run: /usr/bin/scontrol -o show job=31178
2022/11/02 12:51:03:ERROR:
2022/11/02 12:51:03:ERROR: slurm_load_jobs error: Invalid job id specified

The output of the commands is

$ /usr/bin/scontrol -o show job=31178
slurm_load_jobs error: Invalid job id specified

$ sacct -j 31178 -P -n --fields=JobId,User,Group,Partition,Start,End,State,ReqMem,MaxRSS,NCPUS,TotalCPU,NNodes,WorkDir,Elapsed,ExitCode,Comment,Cluster,NodeList,TimeLimit,TimelimitRaw,JobIdRaw,JobName

31178|censored_user_name|domänen-benutzer|ukp|None|2022-11-02T10:37:04|CANCELLED by 1060117793|36G||4|00:00:00|1|censored_path|00:00:00|0:0||ukp-cluster|None assigned|3-00:00:00|4320|31178|vada

Thanks for the info. Are you able to tell me the sequence of events that lead to this issue?

It's odd again that sacct is showing an end time for the job but not a start time.

slurm-send-mail should be invoked by cron once per minute so any jobs processed from the Slurm-Mail spool files should still be present in the slurmctld job cache.

What have you set MinJobAge to in your slurm.conf? E.g.

scontrol show config | grep MinJobAge

Is it also possible to update to Slurm 22.05.5?

I think I reproduced it:

  1. Queue a job, cancel it before it runs (I requested so many GPUs that our Slurm does not have them ready atm to make sure it will not run)
  2. Cancel the job before it runs
  3. See the result in the bug description

I've just pushed some new commits that do the following:

  • Added new integration test for testing jobs that are cancelled before they are despatched
  • Added new never-ran.tpl template to be used for jobs that are never despatched
  • Added logic to detect jobs that are cancelled before they are despatched

I'd be interested to know if these modifications work for you please?

Note: I have observed slightly different behaviour under Slurm 21.08.8-2 - here it looks like the start time is set for jobs that are cancelled whilst pending. I will investigate this further as the integration tests for this version of Slurm are failing my new test.

Fix committed to the 4.1 branch. Are you able to test if this works for you please?

I see no errors in our log anymore and a lot of mails suddenly got unstuck! I think this is now fixed, thank you very very much for this!

Excellent - thanks for the quick response.

Closing ticket.