eth-cscs / firecrest

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`/compute/jobs` returns an error if an incorrect or old job id is included in `jobs`

simonbray opened this issue · comments

I've been trying to track jobs submitted by a workflow with:

job_ids = [33322827, 33322828, 33322830 ...]
poll_results = self.firecrest_client.poll_active(machine, jobs=job_ids)

The problem is that if just one of the job ids is invalid or just stale, the response is something like:

firecrest.FirecrestException.FirecrestException: last request: 200 {'task': {'created_at': '2022-12-06T10:07:01', 'data': 'Could not chdir to home directory /users/xyz: No such file or directoryslurm_load_jobs error: Invalid job id specified', 'description': 'Finished with errors', 'hash_id': 'xyz', 'last_modify': '2022-12-06T10:07:01', 'service': 'compute', 'status': '400', 'task_id': 'xyz', 'task_url': 'https://xyz/tasks/000', 'updated_at': '2022-12-06T10:07:01', 'user': 'xyz'}}

And I get no information about the remaining valid jobs which I need to poll, and I don't know even which of the ids is at fault, so I can't exclude it. :(

Actually, what I realised I can do is to make the requests without the jobs parameter and then filter the result myself. But if a user does want to specify jobs for whatever reason, it would be nice if the errors could somehow be returned specific to the job id.

But if a user does want to specify jobs for whatever reason, it would be nice if the errors could somehow be returned specific to the job id.

I agree with you that this error is not very nice but it comes from certain limitations in the output of squeue. Specifically, for the GET /compute/jobs endpoint (which is the one used in poll_active), firecrest runs the squeue command in the cluster and parses the output. So when the squeue command fails just with an error (in this case it is slurm_load_jobs error: Invalid job id specified) it doesn't give more details about the valid jobs and firecrest would have to perform (possibly) many more commands to get the status of each job individually.

It is a bit strange though, that I didn't manage to reproduce your error. Is it possible that it failed with one jobid? So with firecrest and also running directly on the cluster squeue -j 1 or squeue -j 2 would fail with an error similar to yours, but squeue -j 1,2 didn't (it simply returns with no info). Same for more recent jobs.

Actually, what I realised I can do is to make the requests without the jobs parameter and then filter the result myself.

I think your filtering is a valid solution but it depends what you want to achieve. You may also want to use the poll method that is using in the background the sacct command of slurm. The accounting command contains information for older jobs and you can see there what is the state (failed/cancelled etc), nodelist etc.

Hi @simonbray, I will close for now but feel free to reopen if you still have doubts about the functionality or need a modification in firecrest's code.