raphael-group / decifer

DeCiFer is an algorithm that simultaneously selects mutation multiplicities and clusters SNVs by their corresponding descendant cell fractions (DCF).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

create_input.ipynb fails to generate input file

gbnci opened this issue · comments

commented

Hello, I have some issues related of using create_input.ipynb to generate the input file for decifer analysis. I hope I have followed the instruction exactly (four steps on the Github: https://github.com/raphael-group/decifer/tree/main/scripts/input_from_varscan
Step1: I got all mpileup files for 3 samples (including normal and two tumors), then run varscan on two tumor samples and generated two snp files. The file format for snp looks like this:
chrom position ref var normal_reads1 normal_reads2 normal_var_freq normal_gt tumor_reads1 tumor_reads2 tumor_var_freq tumor_gt somatic_status variant_p_value somatic_p_value tumor_reads1_plus tumor_reads1_minus tumor_reads2_plus tumor_reads2_minus normal_reads1_plus normal_reads1_minus normal_reads2_plus normal_reads2_minus
chr22 10514994 G A 4 10 71.43% R 10 10 50% R Germline 1.6951461011986938E-8 0.9470358906568122 9 1 10 0 3 1 8 2
chr22 10521826 G A 3 15 83.33% A 1 24 96% A Germline 2.687210146800099E-20 0.19009804715987358 1 0 14 10 0 3 11 4
chr22 10522591 C A 8 4 33.33% M 7 5 41.67% M Germline 7.796188798107704E-4 0.500000000000003 2 5 1 4 5 3 3 1

Step2. generated mpileup.tsv files from each tumor samples with file format like this:
chr22 10514994 G,A,<> 16,14,0
chr22 10521826 G,A,<
> 21,31,0
chr22 10522591 C,A,<> 7,5,0
chr22 10522819 C,T,<
> 16,26,0
chr22 10557811 T,C,<> 49,38,0
chr22 10558043 C,T,<
> 82,12,0
chr22 10559720 T,C,<*> 26,50,0
Step3: I used best.seg.ucn generated from Hatchet as the CNA file with format like this:
#CHR START END SAMPLE cn_normal u_normal cn_clone1 u_clone1 cn_clone2 u_clone2
chr1 1 1215461 Case10_Metastasis 1|1 0.364014 5|1 0.605986 4|2 0.03
chr1 1 1215461 Case10_Tumor 1|1 0.37192 5|1 0.0581929 4|2 0.569887
chr1 1215461 15786182 Case10_Metastasis 1|1 0.364014 6|2 0.605986 4|2 0.03

Step4: For create_input.ipynb, first I change the file paths and names accordingly in the create_input.ipynb and ran the command at our server:
module load jupyter
module load python
jupyter execute create_input.ipynb

After minutes to hours, it finish the job with two file, purity file seems right, but for "decifer.input.tsv", what I got is only header.
############################################
$more decifer.input.tsv
245440 #characters
2 #samples
#sample_index sample_label character_index character_label ref var
###############################################
no other data in the file. I also try to run whole genome instead of only chr22 shown above, same output. I reset the p value=1 and also reduced the Minreads in the create_input.ipynb file, nothing changes. I am wondering whether you can give some suggestions about my issue.
Very appreciated for your help
YW

Hi YW,
Apologies for my delayed reply! Could you instead try the script in the directory scripts/vcf_2_decifer.py?? Under the section "required input data", this is the script we recommend using. The notebook you've used for varscan input is a bit dated and not supported, as it's tailored specifically to varscan output files instead of VCF files, which is the standard file format for many, if not all, variant callers.
Let me know if you have any further quesitons.
Brian

commented

That sounds good. Both of those variant callers should output VCF files. Just as a heads up, the vcf_2_decifer.py script works best if you've done mutli-sample calling for each patient. That is, if there are multiple tumor samples for a particular patient, do variant calling on all of these simultaneously such that you get a single VCF for the patient with multiple sample columns, one sample column per tumor sample.