brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[error] SIGSEGV: Illegal storage access. (Attempt to read from nil?) Segmentation fault (core dumped) (

egenomics opened this issue · comments

Hi,
I am getting an error in the last step of ancestry calling, after succesfully generating all the .somalier query files and downloading the relevant ancestry-labels-1kg.tsv file and 1kg.somalier/.somalier files.

Here is the code that we have tested in two different machines with the same error

(base) jlvillanueva@EEP10709:~/Downloads/somalier_aina$ ll
total 117916
drwxrwxr-x  4 jlvillanueva jlvillanueva     4096 Dec 18 15:45 ./
drwxr-xr-x 54 jlvillanueva jlvillanueva    40960 Dec 18 16:06 ../
drwxrwxr-x  3 jlvillanueva jlvillanueva     4096 Dec 18 15:44 1kg.somalier/
-rw-rw-r--  1 jlvillanueva jlvillanueva 82856769 Dec 18 15:44 1kg.somalier.tar.gz
-rw-rw-r--  1 jlvillanueva jlvillanueva    56028 Dec 18 15:44 ancestry-labels-1kg.tsv
drwxrwxr-x  2 jlvillanueva jlvillanueva     4096 Dec 18 15:09 cohort/
-rw-rw-r--  1 jlvillanueva jlvillanueva   265818 Dec 18 15:44 sites.hg38.vcf.gz
-rwxrwxr-x  1 jlvillanueva jlvillanueva 37500280 Dec 18 15:44 somalier*
(base) jlvillanueva@EEP10709:~/Downloads/somalier_aina$ ./somalier ancestry --labels ancestry-labels-1kg.tsv 1kg.somalier/*.somalier ++ cohort/*.somalier
somalier version: 0.2.18
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Segmentation fault (core dumped)

Hi, can you run the same command with the binary attached here (after gunzip somalier_dbg.gz && chmod +x somalier_dbg) and show the output?
somalier_dbg.gz

Hi,
Thanks for the quick response! I get the following error:

(base) jlvillanueva@EEP10709:~/Downloads/somalier_aina$ ./somalier_dbg ancestry --labels ancestry-labels-1kg.tsv 1kg.somalier/*.somalier ++ cohort/*.somalier
somalier version: 0.2.19
/home/brentp/src/somalier/src/somalier.nim(276) somalier
/home/brentp/src/somalier/src/somalier.nim(263) main
/home/brentp/src/somalier/src/somalierpkg/ancestry.nim(137) ancestry_main
/nim-1.6.6/lib/system/fatal.nim(53) sysFatal
Error: unhandled exception: index out of bounds, the container is empty [IndexDefect]

It seems that the training matrix (1kg) is empty so either the sites don't match or you don't have samples in that directory. What does:

ls -lh 1kg.somalier/*.somalier | head

show?

I feel a bit dumb... There is another folder inside 1kg.somalier. I have fixed the command. However it still gives an error:

./somalier_dbg ancestry --labels ancestry-labels-1kg.tsv 1kg.somalier/1kg-somalier/*.somalier ++ cohort/*.somalier
somalier version: 0.2.19
Segmentation fault (core dumped)

Hmm. that's a problem that we're not getting any information beynd the segfault now.

We have tested it in two different computers with the same error :(

Yes, I expect that it will be the same on any machine. How many samples are you looking at?
I attach here another binary with hopefully more debug info turned on. Maybe it will give us more clues.
somalier_dbg2.gz

The ancestry stuff is, as you're finding, less used and more prone to problems than the rest of somalier. You might also try python scripts/ancestry-predict.py which uses PCA -> SVM instead of a neural network. You can run that with -h to see the arguments.

I am looking at 24 samples:

ls cohort/*.somalier | wc -l
24

I have tried the debug binary version2 but I get no more information than with the previous one:

./somalier_dbg2 ancestry --labels ancestry-labels-1kg.tsv 1kg.somalier/1kg-somalier/*.somalier ++ cohort/*.somalier
somalier version: 0.2.19
Segmentation fault (core dumped)

About the python script I get a strange error:

python code/somalier/scripts/ancestry-predict.py --labels ancestry-labels-1kg.tsv --backgrounds 1kg.somalier/1kg-somalier/*.somalier --samples cohort/*.somalier --plot test_plot
Traceback (most recent call last):
  File "/home/jlvillanueva/Downloads/somalier_aina/code/somalier/scripts/ancestry-predict.py", line 171, in <module>
    df_pca = df_pca.append(
  File "/home/jlvillanueva/miniconda3/lib/python3.9/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'

Thanks again for your assistance Brent!

The plot is generated though:
test_plot

Looks like append is gone from pandas. You can change line 171,172 from:

            df_pca = df_pca.append(
                other=(pd.DataFrame(test_reduced, test_samples, labels_pc)))

to:

            df_pca = pd.concat([df_pca, pd.DataFrame(test_reduced, test_samples, labels_pc)])

I think that should work, but haven't tested it.

You can also change other things in the script. For example, line 92 you can change n_components to 3.
You can also see the other parameters to change for the SVM: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

If you make all of these changes and get something that looks good, I'd be happy to get a PR that incorporates the changes.

With these changes it looks like it works. I have tried modifying the components to 3 and for the test run and visually speaking it looks better at assigning populations.
I will run it in many more samples to see what we get.

Do you know if there is a background dataset with more population granularity? It will be quite interesting to know the population of origin for certain patients and continental is a hint but still very general. We usually have exomes and panels of genes, so most intergenic SNPs are not captured.

Thousand genomes has finer subpopulations, but then you have so few training samples that it's not as reliable. There may be other resources for this, but I haven't kept up with them.

glad to hear it's working.