Python implementation of common RNA-seq normalization methods:
- CPM (Counts per million)
- FPKM (Fragments per kilobase million)
- TPM (Transcripts per million)
- UQ (Upper quartile)
- CUF (Counts adjusted with UQ factors)
- TMM (Trimmed mean of M-values)
- CTF (Counts adjusted with TMM factors)
For in-depth description of methods see documentation.
- Pure Python implementation (no need for R, etc.)
- Compatible with Scikit-learn
- Command line interface
- Verbose documentation
- Validated method implementation
We recommend installing RNAnorm with pip:
pip install rnanorm
The implemented methods can be executed from Python or from the command line.
The most common use case is to run normalization from Python:
>>> from rnanorm.datasets import load_toy_data >>> from rnanorm import FPKM >>> dataset = load_toy_data() >>> # Expressions need to have genes in columns and samples in rows >>> dataset.exp Gene_1 Gene_2 Gene_3 Gene_4 Gene_5 Sample_1 200 300 500 2000 7000 Sample_2 400 600 1000 4000 14000 Sample_3 200 300 500 2000 17000 Sample_4 200 300 500 2000 2000 >>> fpkm = FPKM(dataset.gtf_path).set_output(transform="pandas") >>> fpkm.fit_transform(dataset.exp) Gene_1 Gene_2 Gene_3 Gene_4 Gene_5 Sample_1 100000.0 100000.0 100000.0 200000.0 700000.0 Sample_2 100000.0 100000.0 100000.0 200000.0 700000.0 Sample_3 50000.0 50000.0 50000.0 100000.0 850000.0 Sample_4 200000.0 200000.0 200000.0 400000.0 400000.0
Normalization from the command line is also supported. To list available methods and general help:
rnanorm --help
Get info about a particular method, e.g., CPM:
rnanorm cpm --help
To normalize with CPM:
rnanorm cpm exp.csv --out exp_cpm.csv
File exp.csv
needs to be comma separated file with genes in columns and
samples in rows. Values should be raw counts. The output is saved to
exp_cpm.csv
. Example of input file:
cat exp.csv ,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5 Sample_1,200,300,500,2000,7000 Sample_2,400,600,1000,4000,14000 Sample_3,200,300,500,2000,17000 Sample_4,200,300,500,2000,2000
One can also provide input through standard input:
cat exp.csv | rnanorm cpm --out exp_cpm.csv
If file specified with --out
already exists the command will fail. If you
are sure that you wish to overwrite, use --force
flag:
cat exp.csv | rnanorm cpm --force --out exp_cpm.csv
If no file is specified with --out
parameter, output is printed to standard
output:
cat exp.csv | rnanorm cpm > exp_cpm.csv
Methods TPM and FPKM require gene lengths. These can be provided either with GTF
file or with "gene lengths" file. The later is a two columns file. The first
column should include the genes in the header of exp.csv
and the second
column should contain gene lengths computed by union exon model:
# Use GTF file rnanorm tpm exp.csv --gtf annotations.gtf > exp_out.csv # Use gene lengths file rnanorm tpm exp.csv --gene-lengths lenghts.csv > exp_out.csv
To learn about contributing to the code base, read the Contributing section.