lweasel / piquant

A pipeline to assess the quantification of transcripts.

Home Page:http://piquant.readthedocs.org/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Accuracy calculations should be done with TPM, not FPKM

lweasel opened this issue · comments

Use "Transcripts Per Million", rather than "Fragments Per Kilobase per Million mapped reads".

Motivation is that Sailfish doesn't map reads, so we have no way of calculating the "real" FPKM value for comparison with estimates, whereas TPM can be calculated directly from the FluxSimulator expression file. Since accuracy calculations will no longer take into account the proportion of reads actually mapped, will need to ensure that the mapping behaviour of, e.g. Bowtie for transcripts and TopHat for genome, is comparable.

For current quantifiers:
RSEM - reports TPM
eXpress - reports TPM
Cufflinks - reports FPKM. Will need to convert to TPM via TPM_i = 10^6 FPKM_i / (sum_j FPKM_j)
(Sailfish - reports TPM).

Thanks for the nice tools complied here.
I have difficulty understanding your conversion here: "Cufflinks - reports FPKM. Will need to convert to TPM via TPM_i = 10^6 FPKM_i / (sum_j FPKM_j)"
According to your formula, it's a simple linear tranformation from FPKM to TPM.

“FPKM” – fragments per kilobase of exon per million reads,
Isn't the scaling factor of isoform length/effectivegenelength different for each isoform/gene ?

Thanks.

Hi @kyzhao,

The above formula is correct. The FPKM is a length-normalized abundance for transcripts (as is the TPM). Within a particular sample, the only difference between the two is a global scaling factor (1 / \sum_{j} FPKM_j) * 10^6. To see this, think about the FPKM equation:

FPKM_i = (f_i / ((l_i / 1000) / (F / 10^6)) = 10^9 * (f_i / (l_i * F))

where f_i is the number of fragments mapping to transcript i, l_i is the length of transcript i, and F is the total number of mapped fragments. Now, it's clear that 10^9 is simply a scaling factor, as is 1/F. The only quantity here that changes per-transcript is (f_i / l_i) --- the number of fragments mapping to a transcript divided by its length. This quantity is simply a length-normalized measure of abundance. Now, the TPM is given by:

TPM_i = 10^6 * \tau_i = 10^*6 * ((\eta_i / l_i) / \sum_{j}(\eta_j / l_j))

here, \tau_i is the transcript fraction for transcript i and \eta_i is the nucleotide fraction for transcript i. \eta_i is directly proportional to the number of reads drawn from transcript i (i.e. directly proportional to f_i in the fpkm equation) --- conceptually, if is the total fraction of all sequenced nucleotides that can be said to originate from transcript i. Therefore, \tau_i is simply a length normalized measure of abundance --- it is the total fraction of all sequenced transcripts that are equivalent to transcript i. The easiest way to think about TPM is, if we had a population of 1,000,000 transcripts, how many copies would we have of transcript i; this is TPM_i.

Long story short, you can see that TPM_i and FPKM_i capture the same essential information--- they are length-normalized measures of abundance. There are, of course, ways in which TPM is superior to FPKM (the total TPMs will always add up to 1,000,000 --- where as this is not true of FPKM). However, within a sample, FPKM and TPM are directly proportional, and so the equation @lweasel is using to transform one into the other (by simply converting FPKM_i to \tau_i by making it a proper fraction and then multiplying by 1,000,000) is valid.

Many thanks for the question @kyzhao, and to @rob-p for the great answer!