cbg-ethz / LongSom

LongSom tool for long-reads

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LongSom

A Snakemake pipeline for calling somatic SNVs, fusions and CNAs in PacBio long-read single-cell RNA-seq cancer samples, using the Trinity Cancer Transcriptome Analysis Toolkit (CTAT), and infer clones based on them.

LongSom takes a bam file and a barcodes file as input, and then uses ctat-mutations to call SNVs, ctat-LR-fusion to call fusions. It then uses Bayesian non-parametric clustering BnpC to cluster cells into subclones based on called SNVs and fusions. In parallel, LongSom uses inferCNV to call CNAs and cluster cells into subclones based on them.

Contents

Requirements

Installation

Clone repository

First, download LongSom from github and change to the directory:

git clone https://github.com/cbg-ethz/LongSom
cd LongSom

Create a conda environment with Snakemake

Install Snakemake:

mamba  create  -c  conda-forge  -c  bioconda  -n  LongSom  snakemake

Using Mamba is highly recommended, for more information. visit Snakemake's installation guide.

Then, activate the environment:

conda activate LongSom

This environment should be activated each time you want to use LongSom

Install Subread (featurecount)

You can download Subread and intall it this way:

wget https://sourceforge.net/projects/subread/files/subread-2.0.6/subread-2.0.6-source.tar.gz
tar zxvf subread-2.0.6-source.tar.gz
cd subread-2.0.6-source/src/
make -f Makefile.Linux

Install CTAT softwares

Download the simg of those three tools:

Place all simg in the bin folder

Install BnpC

Follow BnpC installation instructions (create a conda environment called BnpC).

Usage

File requirements:

Before each usage, you should source the LongSom environment:

conda activate LongSom

The LongSom wrapper script run_LongSom.py can be run with the following shell command:

./run_LongSom

It should run for less than a day on HPC. Output files should be found in the results folder.

Before running the pipeline

  • config file

    • input directory Before running the pipeline, the config/config.yaml file needs to be adapted to contain the path to input bam files. It is provided in the first section (specific) of the config file.
    • resource information In addition to the input path, further resource information must be provided in the section specific. This information is primarily specifying the genomic reference used for the reads mapping and the transcriptomic reference required for isoform classification. An example config.yaml file ready for adaptation, as well as a brief description of the relevant config blocks, is provided in the directory config/.
  • reference files

  • sample map

    • Provide a sample map file, i.e. a tab delimited text file listing all samples that should be analysed, and how many bam files are associated to it (see example below). ID will be used to name files and identify the sample throughout the pipeline.
    • Sample map example:
    sample     files
    SampleA     2
    SampleB     4
    SampleC     2
    
  • input data

    • This pipeline take as input either concatenated or unconcatenated reads PacBio CCS bam files. I you use concatenated reads input, files should be named SampleA_1.bam, SampleA_2.bam, SampleB_1.bam, etc. (sample name should correspond to the sample map). If you use unconcatenated reads as input, files should be named SampleA_1.subreads.bam, etc.

Cite

Arthur Dondi, Nico Borgsmüller, Pedro Ferreira, Brian Haas, Francis Jacob, Viola Heinzelmann-Schwarz, Tumor Profiler Consortium, Niko Beerenwinkel. De novo detection of somatic variants in long-read single-cell RNA sequencing data. Available on biorxiv soon

About

LongSom tool for long-reads


Languages

Language:Python 97.0%Language:R 2.0%Language:Shell 1.1%