parklab / LiRA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Different number of variants between single cell and bulk

KEYS248 opened this issue · comments

At line 375 of the functions.R script, I get the following error

Error in `$<-.data.frame`(`*tmp*`, "id", value = c(".", ".", "rs371110082",  : 
  replacement has 218653 rows, data has 324935
Calls: compare -> $<- -> $<-.data.frame
Execution halted

It seems that the site.frame variable gets populated by the single.cell.vcf.info variable and then attempts to fill more columns of site.frame with the bulk.vcf.info variable. The error seems to occur because bulk.vcf.info has fewer rows than single.cell.vcf.info. Is this situation not supposed to occur? Did I make a mistake with some previous step?

If I kept the rows of bulk.vcf.info that match site.frame, there are still rows of site.frame that would be empty and I'm not sure how the rest of the program will handle that.

Hi, I met the same bug at line 375 of the functions.R script. Have you solved the problem yet?

@red-t Unfortunately I did not solve it. My research group decided to move forward without using LiRA

Can you confirm that the input single-cell and bulk samples were run on the same multisample VCF file?

@KEYS248 All right, thank you.

Can you confirm that the input single-cell and bulk samples were run on the same multisample VCF file?

Sorry, I'm not very clear about "the same multisample VCF file". As a try, I just run GATK for single-cell and bulk sample(one sample for each kind) respectively. Then run LiRA step by step for each sample. The following shows the configure files for single-cell and bulk:

single-cell:
name    SRR475137
analysis_path   SRR475137
reference_file   human_g1k_v37.fasta
bam   SRR475137.bam
vcf     SRR475137.vcf.gz
gender  female
sample  SC
bulk    F
reference_identifier    hg19
phasing_software        eagle
only_chromosomes        21,22
bulk:
name    SRR475185
analysis_path   SRR475185
reference_file   human_g1k_v37.fasta
bam   SRR475185.bam
vcf     SRR475185.vcf.gz
gender  female
sample  BULK
bulk    T
reference_identifier    hg19
phasing_software        eagle
only_chromosomes        21,22

I see that the VCF field in both is different. So I think it is likely that what I suggested is the issue (it will be unless the VCFs in both configs describe the same set of sites).

"The same multisample VCF file" refers to a VCF file with multiple sample columns describing evidence for the listed variants across multiple BAM files. It can be created using GATK. See, e.g.: https://gatk.broadinstitute.org/hc/en-us/articles/360035889971

The instructions relating to calling on multiple input GVCFs, e.g. "If you have GVCFs from multiple samples..." produce the expected LiRA input, where the multiple samples include at least the bulk and single-cell.

Given this has caused some confusion, we will update the README to clarify.

I think what @cbohrson is saying (correct me otherwise) is that the program may be expecting all samples, both single cell and bulk, to come from a single multisample VCF, instead of two VCFs (one for single cell, one for bulk). If that is the case, I believe I was not doing this so I wonder if that was the cause of my error.

OK, I just ran GATK with single sample mode before. I'll run GATK again to create a multisample VCF file, then try LiRA again. Thanks a lot!