A Snakemake pipeline for calling somatic SNVs, fusions and CNAs in PacBio long-read single-cell RNA-seq cancer samples, using the Trinity Cancer Transcriptome Analysis Toolkit (CTAT), and infer clones based on them.
LongSom takes a bam file and a barcodes file as input, and then uses ctat-mutations to call SNVs, ctat-LR-fusion to call fusions. It then uses Bayesian non-parametric clustering BnpC to cluster cells into subclones based on called SNVs and fusions. In parallel, LongSom uses inferCNV to call CNAs and cluster cells into subclones based on them.
- Python 3.X
- Mamba/Conda
- Singularity (https://sylabs.io/docs/)
First, download LongSom from github and change to the directory:
git clone https://github.com/cbg-ethz/LongSom
cd LongSom
Install Snakemake:
mamba create -c conda-forge -c bioconda -n LongSom snakemake
Using Mamba is highly recommended, for more information. visit Snakemake's installation guide.
Then, activate the environment:
conda activate LongSom
This environment should be activated each time you want to use LongSom
You can download Subread and intall it this way:
wget https://sourceforge.net/projects/subread/files/subread-2.0.6/subread-2.0.6-source.tar.gz
tar zxvf subread-2.0.6-source.tar.gz
cd subread-2.0.6-source/src/
make -f Makefile.Linux
Download the simg of those three tools:
- ctat-LR-fusion (https://data.broadinstitute.org/Trinity/CTAT_SINGULARITY/ctat-LR-fusion/) (tested on V0.13.0)
- ctat InferCNV (https://data.broadinstitute.org/Trinity/CTAT_SINGULARITY/InferCNV/) (tested on V1.16.0)
- ctat-mutations (https://data.broadinstitute.org/Trinity/CTAT_SINGULARITY/CTAT_MUTATIONS/) (tested on V4.0.0)
Place all simg in the bin
folder
Follow BnpC installation instructions (create a conda environment called BnpC).
File requirements:
- .bam file (with BC as barcode tag)
- barcodes .txt file
- genome .fa file (hg38)
- transcriptome .gtf file (https://www.gencodegenes.org/human/)
Before each usage, you should source the LongSom environment:
conda activate LongSom
The LongSom wrapper script run_LongSom.py
can be run with the following shell command:
./run_LongSom
It should run for less than a day on HPC. Output files should be found in the results
folder.
-
config file
- input directory
Before running the pipeline, the
config/config.yaml
file needs to be adapted to contain the path to input bam files. It is provided in the first section (specific
) of the config file. - resource information
In addition to the input path, further resource information must be provided in the section
specific
. This information is primarily specifying the genomic reference used for the reads mapping and the transcriptomic reference required for isoform classification. An exampleconfig.yaml
file ready for adaptation, as well as a brief description of the relevant config blocks, is provided in the directoryconfig/
.
- input directory
Before running the pipeline, the
-
reference files
- A genome fasta file (http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg38)
- A GENCODE gene annotation gtf file (https://www.gencodegenes.org/human/)
-
sample map
- Provide a sample map file, i.e. a tab delimited text file listing all samples that should be analysed, and how many bam files are associated to it (see example below). ID will be used to name files and identify the sample throughout the pipeline.
- Sample map example:
sample files SampleA 2 SampleB 4 SampleC 2
-
input data
- This pipeline take as input either concatenated or unconcatenated reads PacBio CCS bam files. I you use concatenated reads input, files should be named
SampleA_1.bam
,SampleA_2.bam
,SampleB_1.bam
, etc. (sample name should correspond to the sample map). If you use unconcatenated reads as input, files should be namedSampleA_1.subreads.bam
, etc.
- This pipeline take as input either concatenated or unconcatenated reads PacBio CCS bam files. I you use concatenated reads input, files should be named
Arthur Dondi, Nico Borgsmüller, Pedro Ferreira, Brian Haas, Francis Jacob, Viola Heinzelmann-Schwarz, Tumor Profiler Consortium, Niko Beerenwinkel. De novo detection of somatic variants in long-read single-cell RNA sequencing data. Available on biorxiv soon