This pipeline implements the method described in Estimating tumor mutational burden from RNA-sequencing without a matched-normal sample, and should be used after running RNA_MuTect. This pipeline runs on a Linux machine only.
- python3 packages:
- pandas (1.1.5+)
- NumPy (1.19.4+)
- scikit-learn (0.23.2+)
- matplotlib (3.3.3+)
- CAPY python package (0.1+)
- Funcotator as part of the gatk package (4.2.6.1+)
- Samtools:
- ~300 GB space: the 'resource' folder will be around 230 GB, and more space will be required (depending on the number of samples).
Directory names can be changed in the configuration file
Data/
'cancer_dir'/ #project-specific
input/
call_stats/
maf/
resource
BCF_tools_dbs/
merged.vcf.gz #ESP db
pon/
'RNA_binary'
'DNA_binray'
reference/
'reference.fasta'
'reference.fasta.fai'
'reference.dict'
The config.py
file should be configured by the user.
- Directory configuration:
- 'cancer_type' is the name of the project-specific directory.
- other directories and file names can be changed using this file if desired.
- Learning configuration: in this section, you can play with the learning parameters and features.
- Environment configuration is used to configure some tools' locations.
tools
is the location of the samtools and GATK binaries
- As mentioned before, the input of this tool is the output of RNA-MuTect.
A cloud implementation can be found in Terra.
- In order to run RNA-MuTect any normal sample can be used and it does not require the matched-normal sample.
- Details for location of PoN files are in the manuscript under 'Data Availability'.
- The human reference genome hg19 reference files should be used.
- After downloading the repo, directory configuration should be done, using the
config.py
file:- Under the 'Data' folder:
- create a 'cancer_dir' folder and configure its name in
config.py
.
- create a 'cancer_dir' folder and configure its name in
- Under the 'cancer_dir' folder:
- Create an 'input' folder, and under it a 'maf' and 'call_stats' folders.
- Download 'call_stats_capture_paper_v1_3' files (RNA-MuTect output) into 'call_stats' folder.
- Download 'maf_file_rna_final_paper_v1_3' files (RNA-MuTect output) into 'maf folder'.
- Under the 'resource' folder:
- download the pon binary files (DNA & RNA) into the 'pon' folder
- download the reference files (including .fasta.fai and .dict files) into the 'reference' folder
- configure downloaded file names in
config.py
.
- Under the 'Data' folder:
Run RNA-Mutect-WMN.py
When the tool is finished successfully, a 'results' directory will be created under the specified 'cancer_dir'. inside 'results' directory:
- Train results
- mean recall and precision scores
- mean recall and precision scores per sample + boxplot
- 'somatics.maf': MAF file of all the variants classified as somatic by the tool. This should be further filtered using RNA-MuTect filtering steps as described in the paper