brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Found Zero Sites

marcustutert opened this issue · comments

Hi,

I'm trying to run somalier to infer relatedness between a large sample of WES datasets (in VCF format).

This dataset has roughly 30k samples and 800 SNPs and I'm using the following command:

/somalier extract /lustre/scratch123/hgi/mdt2/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/WES_QC_SNP.vcf.gz -s sites.hg38.vcf.gz -f /lustre/scratch118/humgen/resources/ref/Homo_sapiens/HS38DH/hs38DH.fa

However I'm getting an error:

somalier version: 0.2.15
[somalier] FORMAT field 'AD' not found for depth information. using genotype only
[somalier] found 0 sites
common.nim(98) write_counts
Error: unhandled exception: somalier: error opening file: ./EGAN00002052028.somalier [IOError]

Not sure exactly what I might be doing wrong here! Cheers.

Hi, thatnks for reporting. Here are the things you can check:

  1. Is your VCF called on hg38?
  2. What are the chromosome names in your VCF? If they don't have "chr" prefix, then you should use https://github.com/brentp/somalier/files/3412454/sites.hg38.nochr.vcf.gz
  3. what was the caller used for these variants? Most callers will have AD field. somalier also looks for GQ and genotype fields.

Perhaps you could share the header and a couple of variants so I can see the problem.

Finally, the error about opening the file is because you don't have permission to open, perhaps you can use somalier extract -d $DIR where $DIR is somehwere you have write permissions. But you'll need to solve the problem with 0 sites found for the somalier files to be useful.

Thanks Brent for the fast and detailed response!
2) solved the issue -- thought I had the chr prefix but I think an earlier file conversion with PLINK removed them without me double checking.

However, it appears that in the process of this extraction some sites were 'lost', as there were only 230 sites extracted, whereas my starting VCF has approximately 800. Is this a cause for concern, as I'm wondering if it is worth doing a deep dive on why/which sites were missed and if that might have an impact on my downstream relatedness calculations.

Will check the error re: the writing out permissions as well, thanks for pointing that out!

Hi, 230 out of 800 sites doesn't seem too bad unless the 800 were based on the sites file from somalier.
That sites file has only ~20K sites.

Hi @brentp, hope you had a nice weekend. I'm just following up re: the error I highlighted above about permissions within a directory in somalier. I dug into this a bit more with my IT team who controls the cluster I am running somalier on, and it turns out what is happening is that there is a storage quota that is being filled when I run somalier. The very strange part, is that somalier, from what I can tell, is only creating a series of fairly small EGAN files that are only 200KB large, and I have 10s of TBs of space available to work with on this directory. However for some reason after running somalier I get this error:

somalier version: 0.2.15
[somalier] FORMAT field 'AD' not found for depth information. using genotype only
[somalier] found 229 sites
common.nim(98)           write_counts
Error: unhandled exception: somalier: error opening file: /lustre/scratch123/hgi/mdt2/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES//EGAN00002052033.somalier [IOError]

This then results in my working directory which all the sudden becomes 100% full (can't even save a single .txt file), which suggests to me some sort of errant process was run when I executed somalier?

Do you think you could help me through this a bit more?

Thanks!

can you show the full command and let me know what was the working directory?

system("/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier extract /lustre/scratch123/hgi/mdt2/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/WES_QC_SNP.vcf.gz -s /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/sites.hg38.nochr.vcf.gz -f /lustre/scratch118/humgen/resources/ref/Homo_sapiens/HS38DH/hs38DH.fa -d /lustre/scratch123/hgi/mdt2/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/")

The working directory was /lustre/scratch123/hgi/mdt2/projects/ibdgwas_bioresource/mt27/CSI/Scratch/

somalier should not be writing anything to your working directory (unless working directory == -d).
What is written to your working directory?

Does anything get written to: /lustre/scratch123/hgi/mdt2/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/ ?
If so, what? and how many files are there?
perhaps there is a limit on number of files per directory in your file system?

That is correct, it's not writing anything to my directory I'm running somalier from (/Scratch), but it is writing to the directory I have specified as output (-d /lustre/scratch123/hgi/mdt2/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/)
The files I get are the following EGANS:

EGAN00001730357.somalier EGAN00001730398.somalier EGAN00001730432.somalier EGAN00001730515.somalier EGAN00002052027.somalier
EGAN00001730359.somalier EGAN00001730399.somalier EGAN00001730436.somalier EGAN00001730520.somalier EGAN00002052028.somalier
EGAN00001730361.somalier EGAN00001730400.somalier EGAN00001730438.somalier EGAN00001730522.somalier EGAN00002052029.somalier
EGAN00001730363.somalier EGAN00001730401.somalier EGAN00001730441.somalier EGAN00001730524.somalier EGAN00002052030.somalier
EGAN00001730368.somalier EGAN00001730402.somalier EGAN00001730445.somalier EGAN00001730527.somalier EGAN00002052031.somalier
EGAN00001730371.somalier EGAN00001730403.somalier EGAN00001730446.somalier EGAN00001730529.somalier EGAN00002052032.somalier
EGAN00001730373.somalier EGAN00001730404.somalier EGAN00001730447.somalier EGAN00002052013.somalier
EGAN00001730376.somalier EGAN00001730412.somalier EGAN00001730463.somalier EGAN00002052014.somalier
EGAN00001730379.somalier EGAN00001730414.somalier EGAN00001730465.somalier EGAN00002052015.somalier
EGAN00001730381.somalier EGAN00001730416.somalier EGAN00001730466.somalier EGAN00002052016.somalier
EGAN00001730382.somalier EGAN00001730418.somalier EGAN00001730467.somalier EGAN00002052017.somalier
EGAN00001730383.somalier EGAN00001730420.somalier EGAN00001730468.somalier EGAN00002052018.somalier
EGAN00001730386.somalier EGAN00001730421.somalier EGAN00001730469.somalier EGAN00002052019.somalier
EGAN00001730387.somalier EGAN00001730422.somalier EGAN00001730476.somalier EGAN00002052020.somalier
EGAN00001730389.somalier EGAN00001730423.somalier EGAN00001730478.somalier EGAN00002052021.somalier
EGAN00001730392.somalier EGAN00001730424.somalier EGAN00001730479.somalier EGAN00002052022.somalier
EGAN00001730393.somalier EGAN00001730427.somalier EGAN00001730480.somalier EGAN00002052023.somalier
EGAN00001730395.somalier EGAN00001730428.somalier EGAN00001730489.somalier EGAN00002052024.somalier
EGAN00001730396.somalier EGAN00001730430.somalier EGAN00001730497.somalier EGAN00002052025.somalier
EGAN00001730397.somalier EGAN00001730431.somalier EGAN00001730512.somalier EGAN00002052026.somalier

I would be shocked, but I have checked in with IT, that ~30 files will put me over my quota. Each of the files have about 212KB size as well, so nothing large either.

So you say this:

This then results in my working directory which all the sudden becomes 100% full (can't even save a single .txt file)

and this:

it's not writing anything to my directory I'm running somalier from

but also:

This then results in my working directory which all the sudden becomes 100% full (can't even save a single .txt file

so you must be hitting some sort of limit. Somalier needs to write a somalier file for each sample in the VCF.

You could do a quick test like:

for i in $(seq 1 10000); do echo $i > $i.tmp; done

and see if this creates the same problems.