tfbs-studies

Bioinformatics study on transcription factors binding sites (TFBS) and their association with specific tissues.

This repository needs another directory, at the same parent directory, called "input" (or "../input", relatively"). It is available at http://work.bioinformatics-brazil.org/~pitagoras/TF-findings/input/.

The results are stored in the "../results/" directory. The scripts must be executed in order(1-..., 2-..., 3-... etc).

Tasks:

Task 1. Finding TFBSs for each gene

First we need to know that transcription is the process where a gene's DNA sequence is copied (transcribed) into an RNA molecule. So when enzyme RNA polymerase, which makes a new RNA molecule from a DNA template, his must attach to the DNA of the gene. It attaches at a spot called the promoter and his can attach to the promoter region only with the help of proteins called transcription factors inside their binding sites. So transcription factors are proteins that regulate the transcription of genes (activating or repressing), that is their copying into RNA, on the way to making a protein. So the binding sites are regions that are linked to transcriptional regulation of a gene.

For this task, our input data is:

Transcription factors file: TFBS-ENCODE.bed
Genes reference file: coord_upstream5kb_refGene.bed

This command has shown to be efficient at intersecting the two inputs (0-createIntersectionBed.sh):

bedtools intersect -wa -wb -a TFBS-ENCODE.bed -b coord_upstream5kb_refGene.bed > bedIntersectWaWbTFBSinGenes.bed

Ouput size: +- 196MB

1.969.951 lines, almost 2 million transcription factor binding sites related to some gene on the reference input.

Available at: http://work.bioinformatics-brazil.org/~pitagoras/TF-findings/results/bedIntersectWaWbTFBSinGenes.bed

Task 2: Finding Transcription Factors related with tissues using FPKM values

Input data:

Human TFBSs intersected with genes in .bed (link)
Expression (FPKM) of human genes in each tissue, tabular table in .txt (link)

Desired Output:

TFs most or less represented in a single tissue, compared with the others

The idea here is finding out Transcription Factors which, compared with their representation in all tissues, are very or less related to specific tissues. Since this is a more complex task, we will separate it in several subtasks:

Task 2.1: Filtering intersection .bed archive from previous task

The intersection .bed archive has a lot of information that are not useful for this task. We just want to know if the transcription factors had intersections with genes (and what genes). So we will delete most columns, leaving just:

tfName	geneName	count
TAF1	MIR1302-11	12
TAF1	WASH7P	5
HEY1	MIR1302-2	1
HEY1	MIR1302-10	2
...	...

All this says is that MIR1302-11 and WASH7P have binding sites to TAF1, MIR1302-2 and MIR1302-10 have to HEY1 and so on... The names at "gene" are an abbreviation of the gene name column at the original .bed file. Gene names not found at the reference FPKM file are discarted.

For each individual gene, there can be more than one binding site to the same transcription factor. The 'count' column is this number of binding sites.

The script used to do this was 1-filterIntersectionBed.py.

Output:

bedIntersectWaWbTFBSinGenesFiltered.tsv

Task 2.2: Transcription Factor representativity in tissues

Here’s how you do it for RPKM: First, count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor. Then, divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM). Finally, divide the RPM values by the length of the gene, in kilobases. This gives you RPKM. (RNA-Seq blog)

FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice). (also RNA-Seq blog)

Now we want to calculate, for each transcription factor, the sum of the FPKMs (of each tissue) of the genes which have binding sites to that transcription factor. The resulting table would have the following columns:

tfName	adipose_tissue	adrenal_gland	...	genesWithBS	bindingSites	mean	stDeviation
Transcription Factor name	Sum of the FPKMs in adipose tissue	the same for adrenal_gland	more sums for more tissues	Count of genes with binding sites to TF	Total of binding sites to the TF in all tissues	Mean FPKM sum	Standard deviation of FPKM sum

This is similar to the reference expression file with the FPKM values, but this table is about the transcription factors, not the genes.

The function to create the columns is:

def getTFRow(tFactorName):
    print("\tStarting for " + tFactorName)
    tfDF = filteredBedIntersectDf[filteredBedIntersectDf.tfName == tFactorName]
    newRow = dict()
    newRow["tfName"] = tFactorName
    nGenes = 0
    bindingSites = 0
    for tissue in tissueNames:
        newRow[tissue] = 0.0
    for index, row in tfDF.iterrows():
        gName = row['geneName']
        nBindingSites = row['count']
        #this for loo is suposed to have only 1 iteration, unless there are 2
        #or more genes with he same name
        for geneIndex, geneRow in geneFpkmDf[geneFpkmDf.geneName == gName].iterrows():
            nGenes += 1
            bindingSites += nBindingSites
            for tissue in tissueNames:
                newRow[tissue] += geneRow[tissue]*nBindingSites
    newRow["genesWithBS"] = nGenes
    newRow["bindingSites"] = bindingSites
    print("\tDone for " + tFactorName)
    return newRow

In 2-makeTFactorDf.py

These columns are computed using Python's "multiprocessing.Pool", to maximize performance in multi-core systems.

Output:

tfFPKMinTissues.tsv

Task 2.3: Filtering pairs of Tissue with TF using Z-Score and Quantile

When a frequency distribution is normally distributed, we can find out the probability of a score occurring by standardising the scores, known as standard scores (or z scores). The standard normal distribution simply converts the group of data in our frequency distribution such that the mean is 0 and the standard deviation is 1. (Laerd Statistics)

For a transcription factor tf and a tissue ts, the z-score is:

z-score = (fpkmSum(tf, ts) - meanFpkmSum(tf)) / standardDeviationForFpkmSums(tf)

(Note that, for each tissue, the transcription factor will have a different Z-Score.)

We want to create a table with the following data:

tfName	tissue	genesWithBS	fpkmSum	mean	stDeviation	zScore
Transcription factor name	Tissue name	Number of genes with binding sites to TF in this tissue	Sum of the tissue FPKM values of all the genes with a binding site to the TF	Mean FPKM sum of TF for all tissues	Standard deviation in FPKM sums	The Z-Score

The rows for such a table are created using the following function:

def getZScoreRows(row, newDFRows):
    mean = row["mean"]
    std = row["stDeviation"]
    for tissueName in tissueNames:
        newRow = dict()
        newRow["tfName"] = row["tfName"]
        newRow["tissue"] = tissueName
        newRow["genesWithBS"] = row["genesWithBS"]
        newRow["fpkmSum"] = row[tissueName]
        newRow["mean"] = mean
        newRow["stDeviation"] = std
        if(std != 0.0):
            newRow["zScore"] = (row[tissueName] - mean) / std
        else:
            newRow["zScore"] = 0.0
        newDFRows.append(newRow)

In 3-zScoreOf-TFxTissue.py.

In these rows, the Z-Score quantifies how much the representativity of a TF in a tissue deviates from what would be normal for that TF. So the TFs we are looking for must have very big or very low Z-Scores. Now, all we need to do is filter the rows according to the Z-Scores.

For a given quantitative column in a Pandas DataFrame, the function "df['columnName'].quantile(0.XX)" returns a value in the column, so that XX% of the values are smaller or equal to it. Using this value, we can filter the top 5% and bottom 5% of the Z-Score rows:

#top and bottom percentage to be taken:
percent = 0.05

zScoreTopQuantile = zScoreDf['zScore'].quantile(1.0-percent)
zScoreBottomQuantile = zScoreDf['zScore'].quantile(percent)
zScoreTop = zScoreDf[zScoreDf.zScore >= zScoreTopQuantile]
zScoreTop.to_csv(zScoreOfTFxTissueTopPath, sep="\t", index=False)
zScoreBottom = zScoreDf[zScoreDf.zScore <= zScoreBottomQuantile]
zScoreBottom.to_csv(zScoreOfTFxTissueBottomPath, sep="\t", index=False)

In 4-filterTFsByZScore.py.

Outputs:

Task 3A: Find TFs related to tissues using χ2 test

Task 3A.1 Make TFs files per Tissue

Using the file named bedIntersectWaWbTFBSinGenesFiltered.tsv, we can know the TFs related with every gene and and using the files .txt found here, we can know the genes founded in every tissue. We just want to know what are the transcription factors pertaining in every tissue based on their genes .txt file. So for every gene pertained in a tissue, we will be going to add all the transcription factors related with that gene, leaving just:

tfName	frequency
...	...
NF-YA	2
Pol2-4H8	15
CTCF	24
CTCF	3
...	...

The script used to do this was 5-makeTFsPerTissue.py.

Output:

Task 3A.2 Filtering quantity of genes for every TF in every tissue

Using the file named bedIntersectWaWbTFBSinGenesFiltered.tsv, we can know the TFs related with every gene and using the files .txt found here, we can know all the TFs founded and their frequence in every tissue. We just want to know quantity of genes for every TF in every tissue. So for every TF found in a tissue, we will be going to count his frequency, leaving just:

tfName	adipose_tissue	adrenal_gland	...	testis	thyriod	allTissues	allGenes
Pol2	12	26	...	739	13	1820	17326
Pol2-4H8	4	30	...	628	11	1254	15210
TAF1	3	26	...	563	9	1112	14875
...	...	...	...	...	...	...	...

The script used to do this was 6-filterGenesByTFs.py.

Output:

tfByGenesAndTissues.tsv

Task 3A.3 Make χ2 test for every TF in every tissue

Using the file named tfByGenesAndTissues.tsv, we can know the frequence of every TF related with a gene in a tissue. So we just want to know if the expression of an TF is depleted or incresed in a tissue. For this we use the chi-square test, this test is used to quantitatively evaluate the relationship between the outcome of an experiment and the expected distribution of the phenomenon. In this case, the observed value is the frequency value of tf in each tissue and the expected value is the division of total tissue genes by the total of existing genes multiplied by the frequency value of tf in all genes. Using the equation below, where the summation occurs when multiple features are observed, however, in this case we are only observing one feature, the frequency. So we can find the value of the chi square:

From the value of chi square, we can find the value of p-value using stats.chi2.cdf function from SciPy module with degree of freedom equal to 1, where the degree of freedom represents the number of features, which represents the probability of observing a statistic test at least as extreme in a chi-squared distribution. Then we will filter the tfs with p-values less than 0.05 and say if they are depleted or increased. So for every TF found in a tissue, we will be going to calculate your qui square and p value and show if the TF is depleted or increased, leaving just:

tfName	Chi-Square Test	P-Value	State
STAT3	8.106659609271945	0.00441029237877788	Depleted
ELF1_(SC-631)	5.50034717859778	0.01901269858244259	Increased
NFKB	5.461921858500528	0.01943525906064092	Increased
...	...	...	...

The script used to do this was 8-chiSquaredOf-TFsPerTissue.py.

Output:

Task 3B: Finding TFs related to tissues using Monte Carlo approach (compare with random samples)

Input:

Lists of genes related to each tissue;
Count of BS between to a TF in a gene: bedIntersectWaWbTFBSinGenesFiltered.tsv;

The input data is used to create several sets and dictionaries:

#key: tissueName, value: set of genes representative to tissueName
genesPerTissue = dict()
#set of all geneNames
geneNames = set()
#set of tfNames
tfs = set()
#key: tfName, value: set of genes with at least one binding site to tfName
genesWithBSTo = dict()
#key: tissueName, value: list of random subsets of 'geneNames'
geneSamples = dict()
#key: (tfName, geneName), value: number of binding sites
tfbsCount = dict()

Our objective here is to, for each pair of tissue (TS) and transcription factor (TF), compare the number of binding sites to TF on the set of genes related TS, with the number of binding sites to TF on random sets of genes. To do that, we first create 2000 random sets of genes with the same size as the set of genes related to TS. Then we calculate the number of binding sites (to TF) on each of these random sets and save the results in an array. Our final number is the percentile of random binding site counts below or equal to the observed number of binding sites on the set of genes related to TS.

All that is calculated using the following method, given a pair of TSxTF:

def getBsCountInSamples(tfName, tissueName):
    genesWithBS = genesWithBSTo[tfName]
    bsCountInSamples = []
    for sample in geneSamples[tissueName]:
        geneSet = sample.intersection(genesWithBS)
        bsCount = countBSToTFinGeneSet(geneSet, tfName)
        bsCountInSamples.append(bsCount)
    return bsCountInSamples
            
def qualifyTFAtTissue(tfName, tissueName):
    genesWithBS = genesWithBSTo[tfName]
    genesInTissueWithBs = genesPerTissue[tissueName].intersection(genesWithBS)
    bsCountInTissue = countBSToTFinGeneSet(genesInTissueWithBs, tfName)
    bsCountInSamples = getBsCountInSamples(tfName, tissueName)
    bsCountInSamples.append(bsCountInTissue)
    
    npArray = np.asarray(bsCountInSamples)
    std = npArray.std()
    median = np.median(npArray)
    perc = stats.percentileofscore(bsCountInSamples, bsCountInTissue, 'weak')
    return (perc, median, std, bsCountInTissue, len(genesInTissueWithBs))

Next, all we need to do is filter which pairs of TSxTF are relevant. The criterion is:

Percentile being outside of the standard deviation (in the random samples);
To be among the most representative to the tissue: percentile >= 95.0;
To be among the least representative to the tissue: percentile <= 5.0;

Output:

All the code is in 7-selectTFsByRandomSample.py.

References

Khan Academy. Gene regulation.

RNA-Seq Blog. RPKM, FPKM and TPM clearly explained.

Lard Statistics. Standard Score.

pentalpha / tfbs-studies

tfbs-studies

Tasks:

Task 1. Finding TFBSs for each gene

Task 2: Finding Transcription Factors related with tissues using FPKM values

Task 2.1: Filtering intersection .bed archive from previous task

Task 2.2: Transcription Factor representativity in tissues

Task 2.3: Filtering pairs of Tissue with TF using Z-Score and Quantile

Task 3A: Find TFs related to tissues using χ2 test

Task 3A.1 Make TFs files per Tissue

Task 3A.2 Filtering quantity of genes for every TF in every tissue

Task 3A.3 Make χ2 test for every TF in every tissue

Task 3B: Finding TFs related to tissues using Monte Carlo approach (compare with random samples)

References

About

Languages