milospjanic / GTExExtractor.2

GTExExtractor.2 is a script that will download and parse individual-level GTEx data set for all tissues and GTEx sample IDs. It will show the distribution of expression for a single gene in a multiple GTEx tissue that are selected by the user, and the script will automate this process for a list of input genes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GTExExtractor.2

Previously, I have generated a script, GTExExtractor, to show the distribution of expression of multiple genes in a single GTEx tissue in a form of violin plots. Here, I generated a script, GTExExtractor.2, that will show the distribution of expression of a single gene in a multiple GTEx tissue that are selected by the user, and the script will automate this process for a list of input genes, outputing a series of pdfs for each gene.

GTEXExtractor.2 is a combined bash/R script to extract individual level data from the GTEx database, and plot RPKM distributions for the genes of interest in a form of violin plots. GTExExtractor.2 will download individual level data for all GTEx tissues stored in a file GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct, in case the file is not present in the working directory. In case the file is present the script will skip the download. The script will read gene names from the genes.txt file provided and extract the RPKM values from all GTEx samples for the provided genes of interest. Values for each gene will be stored in separate files. Tissues of interest have to be stored in a file tissues.txt. Next, sample IDs for the tissues of interest have to be determined. To do that GTExExtractor will check for the file GTEx_Data_V6_Annotations_SampleAttributesDS.txt,that contains sample IDs for each GTEx tissue and download it if not already present. The script will extract the sample IDs from the file GTEx_Data_V6_Annotations_SampleAttributesDS.txt for each tissue of interest that was provided in the file tissues.txt. Then, these sample IDs will be used to match the IDs from the individual gene files with RPKM values generated in the previous step, producing tissue-specific expression data for each gene. The files are then combined into a table. The table is the imported into R with Rscript and plotted as a visually representative violin plot.

Example of usage

Content of the genes.txt

SMAD3
AHR
TCF21

Content of the tissues.txt

Artery − Aorta
Artery − Coronary
Artery − Tibial
Muscle − Skeletal
Spleen
Stomach
Testis
Thyroid

To run the script

wget https://raw.githubusercontent.com/milospjanic/GTExExtractor.2/master/GTExExtractor.2.sh
chmod 755 GTExExtractor.2.sh
./GTExExtractor.2.sh

Check the collection of output pdf files in the working folder. First one is AHR.output_gtexex.pdf:

alt text

Then for the second gene, SMAD3.output_gtexex.pdf alt text

And finally the third gene, TCF21.output_gtexex.pdf alt text

An example output for the complete GTEx set of tissues: alt text

To get the list of all GTEx tissues run the following command in the working folder (excluding SMTSD and Cells - Leukemia cell line (CML)):

cut -f7 GTEx_Data_V6_Annotations_SampleAttributesDS.txt |sort |uniq

Adipose - Subcutaneous
Adipose - Visceral (Omentum)
Adrenal Gland
Artery - Aorta
Artery - Coronary
Artery - Tibial
Bladder
Brain - Amygdala
Brain - Anterior cingulate cortex (BA24)
Brain - Caudate (basal ganglia)
Brain - Cerebellar Hemisphere
Brain - Cerebellum
Brain - Cortex
Brain - Frontal Cortex (BA9)
Brain - Hippocampus
Brain - Hypothalamus
Brain - Nucleus accumbens (basal ganglia)
Brain - Putamen (basal ganglia)
Brain - Spinal cord (cervical c-1)
Brain - Substantia nigra
Breast - Mammary Tissue
Cells - EBV-transformed lymphocytes
Cells - Leukemia cell line (CML)
Cells - Transformed fibroblasts
Cervix - Ectocervix
Cervix - Endocervix
Colon - Sigmoid
Colon - Transverse
Esophagus - Gastroesophageal Junction
Esophagus - Mucosa
Esophagus - Muscularis
Fallopian Tube
Heart - Atrial Appendage
Heart - Left Ventricle
Kidney - Cortex
Liver
Lung
Minor Salivary Gland
Muscle - Skeletal
Nerve - Tibial
Ovary
Pancreas
Pituitary
Prostate
Skin - Not Sun Exposed (Suprapubic)
Skin - Sun Exposed (Lower leg)
Small Intestine - Terminal Ileum
Spleen
Stomach
Testis
Thyroid
Uterus
Vagina
Whole Blood

About

GTExExtractor.2 is a script that will download and parse individual-level GTEx data set for all tissues and GTEx sample IDs. It will show the distribution of expression for a single gene in a multiple GTEx tissue that are selected by the user, and the script will automate this process for a list of input genes.


Languages

Language:Shell 100.0%