kfuku52/amalgkit

Overview

amalgkit contains tools to amalgamate RNA-seq data from diverse research projects to enable a large-scale evolutionary gene expression analysis with unbiased datasets.

Dependency

General

amalgkit metadata

Nothing

amalgkit getfastq

parallel-fastq-dump for --pfd yes (default) *fastp for --fastp yes (default)

amalgkit quant

kallisto

amalgkit curate

R, with various libraries:
- Biobase
- pcaMethods
- colorspace
- RColorBrewer
- sva
- MASS
- NMF
- dendextend
- amap
- pvclust
- Rtsne
- vioplot

Installation

# Installation with pip
pip install git+https://github.com/kfuku52/amalgkit

# This should show complete options
amalgkit -h

`amalgkit metadata` – SRA metadata curation

amalgkit metadata is a subcommand that fetches and curates metadata from the NCBI SRA database. This program needs many config files to enable a tailored metadata curation. See /amalgkit/config/test/. Currently, the config files are available only for RNA-seq data from vertebrate organs. To get a fairly good metadata for other taxa/tissues, you would have to extensively edit the config files.

Test run

mkdir -p amalgkit_out; cd $_

svn export https://github.com/kfuku52/amalgkit/trunk/config

config_dir="./config/test"

amalgkit metadata \
--config_dir ${config_dir} \
--out_dir . \
--entrez_email 'aaa@bbb.com' # Use your own email address.

If you get a network connection error, simply rerun the same analysis. The program will resume the analysis using intermediate files in --out_dir.

Output

metadata_01_raw_YYYY_MM_DD-YYYY_MM_DD.tsv: This table is a reformatted version of SRA metadata in the xml format.
metadata_02_grouped_YYYY_MM_DD-YYYY_MM_DD.tsv: Similar attributes (columns) are grouped into a few categories according to .config settings.
metadata_03_curated_YYYY_MM_DD-YYYY_MM_DD.tsv: A variety of curation steps are applied according to .config settings. Data unsuitable for evolutionary gene expression analysis such as those from miRNA-seq are marked No in the is_qualified column. There are particular samples which have been intensively sequenced (e.g., livers of Bos taurus). Those samples can be subsampled by the --max_sample option and excluded data are marked No in the is_sampled column.
pivot_*.tsv: "species x tissue" pivot tables.

`amalgkit getfastq` – Generate assembly-ready fastq

amalgkit getfastq takes a BioProject/BioSample/SRA ID as input and generates RNA-seq fastq files for transcriptome assembly. In the assembly process, the more RNA-seq libraries you include, the more transcripts you get. However, it's often computationally challenging to get an assembly from overwhelming amount of data. amalgkit getfastq can automatically subsample RNA-seq reads from different libraries. The amount of data you need (specified by --max_bp) depends on many factors including the assembly program you use. See this paper for example.

Test run

mkdir fastq_files

amalgkit getfastq \
--entrez_email 'aaa@bbb.com' \
--id 'PRJDB4514' \
--threads 2 \
--out_dir ./fastq_files \
--max_bp '75,000'

`amalgkit quant` - quantification of RNAseq data

amalkit quant quantifies abundances of transcripts from RNAseq data using Kallisto. All required input and intermediary files are assumed to be in the working directory (default ./).

Input files

Needs fastq files (single end or paired end) for quantification, ideally processed by amalgkit getfastq, but should be able to handle custom data as well.
Needs a reference file (usually a fasta file of cdna sequences) for index building, if --build_index yes (default), OR an index file if --build_index no
--index is either the name given to the index file (default: id_name.idx) for index building (optional in this case), or index file if build_oindex no
results are stored in results_quant

Contents of working directory:

SRR8819967_1.amalgkit.fastq.gz
SRR8819967_2.amalgkit.fastq.gz
arabidopsis_thaliana.fasta (this is a reference genome)

Usage example

amalgkit quant \
--id SRR8819967 \
--index arabidopsis_thaliana.idx \
--ref arabidopsis_thaliana.fasta \
--out_dir ./fastq_files

Output

SRR8819967_abundance.h5: bootstrap results in h5dump format
SRR8819967_run_info.json: contains run info
SRR8819967_abundance.tsv: contains target_id, lentgh, eff_length, est_counts and tpm in human readable .tsv

`amalgkit curate` - transcriptome curation

Input files

output files of merge or cstmm
metadata table from metadata

Usage example

amalgkit curate \
--infile transcriptome.tsv \
--metadata metadata.tsv \
--dist_method 'pearson' \
--tissues brain liver heart embryo \
--out_dir './'

Output

Reference

Although amalgkit supports novel unpublished functions, some functionalities including metadata curation, expression level quantification, and further curation steps have been described in this paper, in which we described the transcriptome amalgamation of 21 vertebrate species.

Fukushima K*, Pollock DD*. 2020. Amalgamated cross-species transcriptomes reveal organ-specific propensity in gene expression evolution. Nature Communications 11: 4459 (DOI: 10.1038/s41467-020-18090-8) open access

Licensing

amalgkit is BSD-licensed (3 clause). See LICENSE for details.

kfuku52 / amalgkit

Overview

Dependency

General

amalgkit metadata

amalgkit getfastq

amalgkit quant

amalgkit curate

Installation

`amalgkit metadata` – SRA metadata curation

Test run

Output

`amalgkit getfastq` – Generate assembly-ready fastq

Test run

`amalgkit quant` - quantification of RNAseq data

Input files

Contents of working directory:

Usage example

Output

`amalgkit curate` - transcriptome curation

Input files

Usage example

Output

Reference

Licensing

About

Languages

Overview

Dependency

General

amalgkit metadata

amalgkit getfastq

amalgkit quant

amalgkit curate

Installation

amalgkit metadata – SRA metadata curation

Test run

Output

amalgkit getfastq – Generate assembly-ready fastq

Test run

amalgkit quant - quantification of RNAseq data

Input files

Contents of working directory:

Usage example

Output

amalgkit curate - transcriptome curation

Input files

Usage example

Output

Reference

Licensing

About

Languages

`amalgkit metadata` – SRA metadata curation

`amalgkit getfastq` – Generate assembly-ready fastq

`amalgkit quant` - quantification of RNAseq data

`amalgkit curate` - transcriptome curation