Jobs submitted to cluster fail without explanation

Question

Jobs submitted to cluster fail without explanation

jonahcullen opened this issue 2 years ago · comments

Hello, thank you for providing these profiles, I have been using the slurm one for years and it has enabled all of our pipeline to run really well.

This past week I noticed an issue that I am unsure how to diagnose. From the error log, jobs are submitted to cluster as expected, singularity is activated (Activating singularity image...) and then error. There is no other information in the err/out files to indicate what happened. I have contacted our HPC admins and they confirmed singularity was working on the given node and cannot see any particular reason on their end for the failure. If I allow restarts in the profile config (restart-times: 2), it resubmits and seems to work, although sometimes it requires more than 1 restart.

Have you seen this before? Is it likely server-side and not an issue with the profile's submission losing the job or something?

Thanks for your time,
Jonah.

John Blischak · Answer 1 · Thu Jan 19 2023 03:13:24 GMT+0800 (China Standard Time)

This past week I noticed an issue that I am unsure how to diagnose.

What all has changed since your last successful run of the pipeline? Did you update Python, Snakemake, etc? Did you update the slurm profile (eg by pulling the latest changes from this repo)? Did you update your pipeline? Did the HPC admins update the version of singularity? If you revert back to the state when it last successfully ran, does it run again?

If I allow restarts in the profile config (restart-times: 2), it resubmits and seems to work, although sometimes it requires more than 1 restart.

This certainly makes it seem like a spurious error. What is the exit state of the job (sacct -j <job num>)? Can you isolate the error with a minimal, reproducible example pipeline?

Jonah Cullen · Answer 2 · Thu Jan 19 2023 04:14:54 GMT+0800 (China Standard Time)

Hi thank you for your quick response!

The pipeline still runs just fine most of the time, but for maybe 10% of submissions these random restarts are required that were not before. There was a routine HPC maintenance day that did in fact cause a number of singularity issues. The singularity related issues were straight forward to identify as the err logs indicated as such. For this new issue, it just errors without explanation. Nothing else has been updated.

Here is an example of a job that failed without explanation

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
70114370     snakejob.+   small   xxx255          4     FAILED      1:0 
70114370.ba+      batch              xxx255          4     FAILED      1:0 
70114370.ex+     extern              xxx255          4  COMPLETED      0:0

John Blischak · Answer 3 · Thu Jan 19 2023 04:26:26 GMT+0800 (China Standard Time)

Nothing else has been updated.

If you haven't updated your Snakefile, the slurm profile, or Snakemake itself, I doubt there is anything that can be done to the scripts in this repo to address your issue. I primarily use conda envs without singularity, so I can't provide any practical advice. Maybe another Slurm user will be able to provide some suggestions

Jonah Cullen · Answer 4 · Thu Jan 19 2023 04:29:55 GMT+0800 (China Standard Time)

Okay, thank you for your help though, I appreciate it!

Michael Hall · Answer 5 · Thu Jan 19 2023 09:14:53 GMT+0800 (China Standard Time)

Not sure if this is related, but I have found in the past that occassionally if there are some problems with a specific node on your cluster - e.g. mounting issues - singularity can silently fail in snakemake as it doesn't produce debug logging for singularity. My suggestion would be to see if you can identify if there is a specific set of nodes that are erroring out and then try execute the singularity container on that node with debug logging turned on and see what happens.

Pau Badia i Mompel · Answer 6 · Wed Apr 10 2024 20:53:31 GMT+0800 (China Standard Time)

Anyone has found a solution to this? I've checked the nodes as suggested by @mbhall88 and unfortunately they seem to fail randomly. It would be good at least to increase the verbosity of the error because currently it is impossible to pinpoint what might be happening. I've also increased latency-wait in case this would help but the error persists.

Michael Hall · Answer 7 · Thu Apr 11 2024 06:31:32 GMT+0800 (China Standard Time)

Are you able to provide some log files for the snakemake master process and the job that failed?

Pau Badia i Mompel · Answer 8 · Thu Apr 11 2024 16:55:23 GMT+0800 (China Standard Time)

Hi @mbhall88,

Thanks for replying. I am not sure what you mean by the master process but here is the log (edited) for one of the job that failed:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=64000, mem_mib=61036, disk_mb=1000, disk_mib=954, threads=128
Select jobs to execute...

[Mon Apr  8 10:04:11 2024]
rule mdl_pando:
    input: ...
    output: ...
    jobid: 0
    benchmark: ...
    reason: Missing output files: ...
    resources: mem_mb=64000, mem_mib=61036, disk_mb=1000, disk_mib=954, tmpdir=/scratch/user/job_3048038_noden11, runtime=360, partition=single, threads=128

        Rscript ...
        
Activating singularity image workflow/envs/pando.sif
[Mon Apr  8 10:04:13 2024]
Error in rule mdl_pando:
    jobid: 0
    input: ...
    output: ...
    shell:
        
        Rscript ...
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Basically it fails when activating/mounting the singularity image. The thing is that this error is quite random, it happens 10% of the time and it happens to any of the jobs that use singularity containers in my pipeline. Since it is random, I have increased the restart-times parameter to increase the chances of a successful run but it takes longer and sometimes it is not even enough. Could it be that many jobs are trying to mount the same image and that's why it fails?

Michael Hall · Answer 9 · Fri Apr 12 2024 07:15:24 GMT+0800 (China Standard Time)

Hmmm I don't really know how else to get at the cause of this. Only other things I could imagine are using a different (newer?) version of singularity/apptainer or trying to replicate on one of the failing nodes... Sorry I can't be of more help