RNA-Mutect-WMN

This pipeline implements the method described in Estimating tumor mutational burden from RNA-sequencing without a matched-normal sample, and should be used after running RNA_MuTect. This pipeline runs on a Linux machine only.

Requirements

python3 packages:
- pandas (1.1.5+)
- NumPy (1.19.4+)
- scikit-learn (0.23.2+)
- matplotlib (3.3.3+)
CAPY python package (0.1+)
Funcotator as part of the gatk package (4.2.6.1+)
Samtools:
- bgzip (1.11+)
- bcftools (1.8+)
- tabix (1.11+)
~300 GB space: the 'resource' folder will be around 230 GB, and more space will be required (depending on the number of samples).

Input files and directory tree

Directory names can be changed in the configuration file

Data/
    'cancer_dir'/                               #project-specific
         input/
            call_stats/
            maf/
    resource
        BCF_tools_dbs/
            merged.vcf.gz                        #ESP db
        pon/
            'RNA_binary'                              
            'DNA_binray'
        reference/
            'reference.fasta'
            'reference.fasta.fai'
            'reference.dict'

Configuration

The config.py file should be configured by the user.

Directory configuration:
1. 'cancer_type' is the name of the project-specific directory.
2. other directories and file names can be changed using this file if desired.
Learning configuration: in this section, you can play with the learning parameters and features.
Environment configuration is used to configure some tools' locations.
1. tools is the location of the samtools and GATK binaries

Running instructions

Inputs preparation

As mentioned before, the input of this tool is the output of RNA-MuTect. A cloud implementation can be found in Terra.
- In order to run RNA-MuTect any normal sample can be used and it does not require the matched-normal sample.
Details for location of PoN files are in the manuscript under 'Data Availability'.
The human reference genome hg19 reference files should be used.
After downloading the repo, directory configuration should be done, using the config.py file:
- Under the 'Data' folder:
  - create a 'cancer_dir' folder and configure its name in config.py.
- Under the 'cancer_dir' folder:
  - Create an 'input' folder, and under it a 'maf' and 'call_stats' folders.
  - Download 'call_stats_capture_paper_v1_3' files (RNA-MuTect output) into 'call_stats' folder.
  - Download 'maf_file_rna_final_paper_v1_3' files (RNA-MuTect output) into 'maf folder'.
- Under the 'resource' folder:
  - download the pon binary files (DNA & RNA) into the 'pon' folder
  - download the reference files (including .fasta.fai and .dict files) into the 'reference' folder
  - configure downloaded file names in config.py.

Run pipeline

Run RNA-Mutect-WMN.py

Results

When the tool is finished successfully, a 'results' directory will be created under the specified 'cancer_dir'. inside 'results' directory:

Train results
1. mean recall and precision scores
2. mean recall and precision scores per sample + boxplot
'somatics.maf': MAF file of all the variants classified as somatic by the tool. This should be further filtered using RNA-MuTect filtering steps as described in the paper

JunhuiLi1017 / RNA_MUTECt_WMN