Pipeline for extracting variants from single cell genomics data
mutCaller is a command line tool written in Rust for identifying SNVs in single cell data that contains a cell barcode (CB) and UMI such as 10X genomics data files. With mutCaller, you can supply variants (VCF file support coming), and obtain a counts file of the number of variants that map to your query and their associated cell-barcodes and umis. mutCaller using the takes as input either unaligned fastqs (UNALIGNED function) or an aligned BAM file with the CB and UMI as tags (ALIGNED function).
mutCaller is written in Rust. To install the rust compiler go to https://www.rust-lang.org/tools/install. mutCaller requires two additional tools be available on the command line, minimap2 (https://github.com/lh3/minimap2) and samtools (https://samtools.github.io).
To install mutCaller:
- clone the repository by typing
https://github.com/furlan-lab/mutCaller.git
from the location you want to build from - enter the cloned repo by typing
cd mutCaller
- build by typing
cargo build --release
- the build process will create a self contained binary executable file in
targets/release
directory calledmutCaller
- move this binary elsewhere if desired (ideally somewhere referenced by your PATH environment variable - e.g.
~/.local/bin
)
version 0.40 - 6/14/23 - majorcode cleanup, added STAR aligner option, added vcf file support including a VARIANT subfunction that allows you to troubleshoot variants files version 0.30 - 5/9/23 - code cleanup and implemented bam option version 0.22 - 5/9/23 - first alpha release
Users can generate a simple, 'variants_file' that lists the variants to be counted (tsv.file). The variants file should look like this:
seqname\tstart\tref_nt\tquery_nt\tname
More detailed explanation:
- seqname - e.g. 'chr1', 'chr2', etc
- position - 1-indexed position (e.g, '112450407')
- ref_nt - nucleotide of the reference at this location
- query_nt - nucleotide of your query
- name - a string given to the name the variant in the outpull file
three full lines should look something like this
seq start ref_nt query_nt name
chr12 112450407 A G PTPN11_227A>G
chr12 208248389 G A IDH1_132G>A
chr17 7674220 C T TP53_248C>T
A barcode whitelist. v3 3prime and v2 5prime 10X whitelists are provided in the data folder.
To run mutCaller simply type:
mutCaller ALIGNED --bam <file.bam> -s <variants.tsv> -o <folder>
mutCaller UNALIGNED --barcodes_file <barcodes_file> --fastq1 <fastq1> --fastq2 <fastq2> --genome <genome> --variants <variants.tsv>
More examples on how to use mutcaller can be found here
mutcaller 0.4.0
Scott Furlan
Single nucleotide variant counting pipeline for single cell genomics data
USAGE:
mutcaller [SUBCOMMAND]
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
ALIGNED Count variants in previously aligned data
UNALIGNED Count variants after aligning data
VCF variants file debugging
help Prints this message or the help of the given subcommand(s)
Note: this is a work in progress. Use with caution.
mutcaller-ALIGNED
Count variants in previously aligned data
USAGE:
mutcaller ALIGNED [FLAGS] [OPTIONS] --bam <bam> --variants <variants>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
-v, --verbose use this flag to run in verbose mode
OPTIONS:
-b, --bam <bam> aligned bam file with cell barcode and umi in tags
-c, --cb_tag <cb> bam tag containing cell barcode; default = 'CB'
-o, --output <output> output path (defaults to 'out'); inside folder a counts file will be called
'counts.txt.gz', log will be called mutcaller.log
-q, --qual <qual> filter for variant quality (float); default = 95.0; only used if VCF file type is
supplied
-t, --threads <threads> threads
-u, --umi_tag <umi> bam tag containing umi; default = 'XM'
-s, --variants <variants> path to variants.tsv or vcf file (SNVs only supported currently); For tsv, example
formating = seqname\tstart\tref_nt\tquery_nt\tname; e.g.
chr12,112450407,A,G,PTPN11_227A>G
mutcaller-UNALIGNED
Count variants after aligning data
USAGE:
mutcaller UNALIGNED [FLAGS] [OPTIONS] --barcodes_file <barcodes_file> --fastq1 <fastq1> --fastq2 <fastq2> --genome <genome> --variants <variants>
FLAGS:
-h, --help Prints help information
-k, --keep_files use this flag to keep files (default is remove intermediate files)
-V, --version Prints version information
-v, --verbose use this flag to run in verbose mode
OPTIONS:
-a, --aligner <aligner> aligner software - currently minimap (default) and STAR are supported; if not
available on command line supply in loc_aligner argument
-l, --aligner_loc <aligner_loc> path to aligner e.g. /app/software/CellRanger/6.0.1/lib/bin/STAR
-b, --barcodes_file <barcodes_file> barcodes_file
-c, --cb_length <cb_len> length of umi sequence
-i, --fastq1 <fastq1> input fastq with barcodes
-j, --fastq2 <fastq2> input fastq with read
-g, --genome <genome> fasta for minimap2 or transcriptome index for kallisto
-o, --output <output> output path (defaults to 'out'); inside folder a counts file will be called
'counts.txt.gz', log will be called mutcaller.log
-q, --qual <qual> filter for variant quality (float); default = 95.0; only used if VCF file
type is supplied
-r, --read_len <read_len> read 2 length (default 90)
-t, --threads <threads> threads
-u, --umi_length <umi_len> length of umi sequence
-s, --variants <variants> path to variants.tsv or vcf file (SNVs only supported currently); For tsv,
example formating = seqname\tstart\tref_nt\tquery_nt\tname; e.g.
chr12,112450407,A,G,PTPN11_227A>G