seq-collection

A collection of useful sequencing utilities.

FASTQ

fq-dedup
fq-meta

BAM

insert-size

VCF

json
fasta

Usage

Install using nim c sc.nim

Development

As I have done in the past, I intend to develop seq-kit over a period of months or years. In general, this means I'll add a new tool or utility when it becomes apparent that I need it.

However, I am open to PRs, requests, and feedback. Please let me know what you think.

I intend to port some commands over from VCF-kit, but will broaden the applicability to FASTQs, BAMs, and perhaps even other items.

Tools

fq-dedup

The fq-dedup command de-duplicates a FASTQ by read ID (e.g. @@D00446:1:140101_HWI-D00446_0001_C8HN4ANXX:8:2210:1238:2018). Ideally, this should never happen, and I honestly do not know how it happens.

The command uses a Bloom filter to identify duplicates, and has to read through the file twice. If no duplicates are found during the first pass, the command will return 0 and print "No Duplicates Found" to stderr. If you are checking a set of FASTQs to remove duplicates, you can use the following bash to handle cases where duplicates are not found.

(sc fq-dedup myfastq.fq.gz 2> dup.err) | gzip > out.fq.gz
if [[ `head -n 1 dup.err` == "No Duplicates Found" ]]; then
    # Handle case where duplicates are not found (probably by moving file)
    mv myout.fq.gz out.dedup.fq.gz
fi

fq-dedup can read both .gz and raw text. It sends the deduplicated FASTQ to stdout.

Be sure to use -d:release compiled binaries with this command otherwise its really slow.

Output

Once complete, the following is sent to stderr:

total_reads: 2500000
duplicates 1086043
false-positive: 0
false-positive-rate: 0.0

The false-positive values are for diagnostics only based on reads initially labeled as duplicates by the bloom filter that were later found not to be.

Benchmark

2.5M Reads; 1M+ duplicates; 2015 MacBook Pro
0m58.738s

fq-count

Count the number of reads in a FASTQ

fq-meta

Scenario: You are given an old dusty hard drive packed with sequence data. Your collaborator says "We have some great sequencing data here, if only someone could analyze it." You peek at the filesystem and discover that FASTQs have been renamed, removing crucial information about how they were generated. Your collaborator, however, recalls certain details about which data was sequenced on which sequencer and he has a list of sequencing barcodes and associated samples that you can match on." If only there was a way to determine the barcodes, sequencer, or other essential metadata for each FASTQ...

If this scenerio sounds familiar, fq-meta™ is for you. It samples the first few sequences of a FASTQ and outputs a summary for every FASTQ including:

A best guess as to the type of sequencer based on instrument and flowcell information
A best guess for the quality score format
The barcode/index used to multiplex the sample which can be useful for identifying samples
The machine, run, and other metadata about the FASTQ
Information derived from the filename
The file location

Data is output in a tab-delimited format so that this tool can be used to assemble a database of FASTQs and their associated information. You can do so like this:

sc fq-meta --header > my_fq_database.txt # Use this to output just the variable names
sc fq-meta *.fq.gz >> my_fq_database.txt

The resulting dataset can be combined with other metadata and filtered to select samples for processing in a pipeline.

You can also parallelize the operation with GNU-parallel.

sc fq-meta --header > my_fq_database.txt # Use this to output just the variable names
parallel -j 8 sc fq-meta ::: find . -name '*.fq.gz' >> my_fq_database.txt

Example output``

machine	sequencer	prob_sequencer	flowcell	flowcell_description	run	lane	index1	qual_format	qual_phred	qual_multiple	min_qual	max_qual	n_lines	basename	absolute_path
D00446	HiSeq2000/2500	high:machine+flowcell	C8HN4ANXX	High Output (8-lane) v4 flow cell	1	8	GCTCGGTA	Sanger;Illumina 1.8+	Phred+33	TRUE	14	14	1	illumina_2000_2500.fq	…
K00100	HiSeq3000/4000	high:machine+flowcell	H300JBBXX	(8-lane) v1 flow cell	33	6	GCCAAT	Sanger;Illumina 1.8+	Phred+33	TRUE	14	14	1	illumina_3000_4000.fq	…
D00209	HiSeq2000/2500	high:machine+flowcell	CACDKANXX	High Output (8-lane) v4 flow cell	258	6	CGCAGTT	Sanger;Illumina 1.8+	Phred+33	TRUE	0	37	1	illumina_6.fq	…
D00209	HiSeq2000/2500	high:machine+flowcell	CACDKANXX	High Output (8-lane) v4 flow cell	258	6	GAGCAAG	Sanger;Illumina 1.8+	Phred+33	TRUE	0	37	1	illumina_7.fq	…

Input

fq-meta accepts both gzipped FASTQs (.fq.gz, .fastq.gz ~ inferred from .gz extension) and raw text FASTQs.

Example Output

Columns

Assembling an FQ-Database

sc fq-meta --header > fastq_db.tsv # Prints just the header
sc fq-meta sample_1_R1.fq.gz sample_1_R2.fq.gz sample_2_R1.fq.gz sample_2_R2.fq.gz >> fastq_db.tsv

insert-size

Calculate the insert-size of a bam or a set of bams. Bams are estimated by evaluating up to the 99.5th percentile of read insert-sizes. This gives numbers that are very close to Picard a lot faster.

sc insert-size --header input.bam # One bam
sc insert-size --header *.bam # Multiple bams

Options

--verbose Output information about progress.
--dist Output the frequency distribution of insert sizes.

Output

median	mean	std_dev	min	percentile_99.5	max_all	n_reads	n_accept	n_use	sample
179	176.5	63.954	38	358	359	237	101	100	AB1

Calculate insert-size metrics on a set of bams.

json (VCF to JSON conversion)

Convert a VCF to JSON. This is most useful in the context of a web-service, where you can serve variant data for the purposes of visualization, browsing, etc. For an example of what this might look like, see the elegans variation variant browser.

The json command can be used to parse ANN columns (effect annotations) by specifying --annotation. Additionally, FORMAT fields (GT, DP, etc) can be zipped with sample names.

Usage:
  sc json [options] vcf region

Arguments:
  vcf              VCF to convert to JSON
  region           Region

Options:
  -i, --info=INFO            comma-delimited INFO fields; Use 'ALL' for everything
  -f, --format=FORMAT        comma-delimited FORMAT fields; Use 'ALL' for everything
  -s, --samples=SAMPLES      Set Samples (default: ALL)
  -p, --pretty               Prettify result
  -a, --array                Output as a JSON array instead of ind. JSON lines
  -z, --zip                  Zip sample names with FORMAT fields (e.g. {'sample1': 25, 'sample2': 34})
  -n, --annotation           Parse ANN Fields
  --pass                     Only output variants where FILTER=PASS
  --debug                    Debug
  -h, --help                 Show this help

Example

sc json --format=GT --zip --pretty tests/data/test.vcf.gz

Outputs:

{
  "CHROM": "I",
  "POS": 41947,
  "ID": ".",
  "REF": "A",
  "ALT": [
    "T"
  ],
  "QUAL": 999.0,
  "FILTER": [
    "PASS"
  ],
  "FORMAT": {
    "GT": {
      "AB1": [
        2,
        2
      ],
      "...": [
        0,
        0
      ]
    }
  }
}

You can also specify custom SGT and TGT output formats which transform GT fields from their typical output.

GT - Outputs genotypes as [[0, 0], [0, 1], [1, 1], ...
SGT - Outputs genotypes as 0/0, 0/1, 1/1, ...
TGT - Outputs genotypes as T/T, A/T, A/A, ...

Cross-compilation

nim c --cpu:i386 --os:linux --threads:on --compileOnly sc.nim

cassimons / seq-collection