brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

somalier ancestry and relate freeze on writing output on BeeGFS filesystem

edg1983 opened this issue · comments

Hi Brent,

I'm setting up a pipeline which includes somalier for QC and I'm experiencing a strange behaviour. Essentially, the pipeline is managed by Nextflow and I place the work directory on our HPC scratch space which uses a BeeGFS file system.

It seems that in this context somalier ancestry and relate commands cannot write output and they remain stalled after completing the computations. If I try to run ancestry for example the computation runs perfectly, until I get:

[somalier] Epoch:9000. loss: 0.02332. accuracy on unseen data: 0.990.  total-time: 108.48
[somalier] Epoch:9500. loss: 0.03565. accuracy on unseen data: 0.990.  total-time: 114.20
[somalier] Epoch:10000. loss: 0.04361. accuracy on unseen data: 0.990.  total-time: 119.92
[somalier] reduced query set to: [3, 5]
[somalier] wrote text file to somalier-ancestry.somalier-ancestry.tsv

Then the program freeze and never exit. If I force it to quit, I can see the output files have been created. The TSV file contains the expected data, while the HTML file is empty. The same happens for relate, but in this case all output files are empty. On the other side, somalier extract works fine.

I observe the same behaviour using the somalier docker container and all commands run perfectly on the same HPC if I work in my /group folder which uses a standard nfs filesystem, so I guess the problem originate from some strange incompatibility between somalier writing method and the BeeGFS filesystem.

Any clue on why this may be happening?

Thanks!

Hi, I have not heard of this problem before, but based on what you have written, am guessing BeeGFS doesn't like writing long lines. It's stalling when writing the html which includes a single very long JSON line. Perhaps you could try writing to a local /tmp from somalier and then moving the file to the BeeGFS system when somalier is finished?

Thanks Brent.
I've updated the nextflow pipeline to write somalier output to $TMPDIR and then move files back to the working dir. This fixed the issue. A little annoying, but I'm glad I sorted this out...
Very strange bug indeed, it took me the whole morning to figure out why the pipeline was frozen.

Yeah, strange. But glad you have a work-around.

Hi there,

Some of our users on GenomeDK observed this problem as well when testing our new AlmaLinux 8.6-based setup. That is, the BeeGFS client has been updated and is running on AlmaLinux 8.6, but the meta/storage server has not been changed.

We straced somalier and saw that it was repeatedly calling writev(), which causes it to "hang". It seems that BeeGFS has a bug in how they handle writev in the client. After lots of debugging we found out that the call to writev() actually comes from musl's implementation of write().

To work around the issue we built a Conda package (recipe here: https://gist.github.com/dansondergaard/a23ec36d3f784ae3c71ae907ee7beca6) that builds somalier, but links to glibc instead of musl. We had to do a bit of patching (I think that the version tag refers to a broken build?), but otherwise it's was pretty smooth.

In short, it seems that the issue is caused by a combination of musl's use of writev, the kernel version and something in the way BeeGFS deals with writev(). It should probably be reported to the BeeGFS developers.

wow! @dansondergaard thanks for digging in and figuring this out!
Much appreciated.