Previously, I have generated a script, GTExExtractor, to show the distribution of expression of multiple genes in a single GTEx tissue in a form of violin plots. Here, I generated a script, GTExExtractor.2, that will show the distribution of expression of a single gene in a multiple GTEx tissue that are selected by the user, and the script will automate this process for a list of input genes, outputing a series of pdfs for each gene.
GTEXExtractor.2 is a combined bash/R script to extract individual level data from the GTEx database, and plot RPKM distributions for the genes of interest in a form of violin plots. GTExExtractor.2 will download individual level data for all GTEx tissues stored in a file GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct, in case the file is not present in the working directory. In case the file is present the script will skip the download. The script will read gene names from the genes.txt file provided and extract the RPKM values from all GTEx samples for the provided genes of interest. Values for each gene will be stored in separate files. Tissues of interest have to be stored in a file tissues.txt. Next, sample IDs for the tissues of interest have to be determined. To do that GTExExtractor will check for the file GTEx_Data_V6_Annotations_SampleAttributesDS.txt,that contains sample IDs for each GTEx tissue and download it if not already present. The script will extract the sample IDs from the file GTEx_Data_V6_Annotations_SampleAttributesDS.txt for each tissue of interest that was provided in the file tissues.txt. Then, these sample IDs will be used to match the IDs from the individual gene files with RPKM values generated in the previous step, producing tissue-specific expression data for each gene. The files are then combined into a table. The table is the imported into R with Rscript and plotted as a visually representative violin plot.
Content of the genes.txt
SMAD3 AHR TCF21
Content of the tissues.txt
Artery − Aorta Artery − Coronary Artery − Tibial Muscle − Skeletal Spleen Stomach Testis Thyroid
To run the script
wget https://raw.githubusercontent.com/milospjanic/GTExExtractor.2/master/GTExExtractor.2.sh chmod 755 GTExExtractor.2.sh ./GTExExtractor.2.sh
Check the collection of output pdf files in the working folder. First one is AHR.output_gtexex.pdf:
Then for the second gene, SMAD3.output_gtexex.pdf
And finally the third gene, TCF21.output_gtexex.pdf
An example output for the complete GTEx set of tissues:
To get the list of all GTEx tissues run the following command in the working folder (excluding SMTSD and Cells - Leukemia cell line (CML)):
cut -f7 GTEx_Data_V6_Annotations_SampleAttributesDS.txt |sort |uniq Adipose - Subcutaneous Adipose - Visceral (Omentum) Adrenal Gland Artery - Aorta Artery - Coronary Artery - Tibial Bladder Brain - Amygdala Brain - Anterior cingulate cortex (BA24) Brain - Caudate (basal ganglia) Brain - Cerebellar Hemisphere Brain - Cerebellum Brain - Cortex Brain - Frontal Cortex (BA9) Brain - Hippocampus Brain - Hypothalamus Brain - Nucleus accumbens (basal ganglia) Brain - Putamen (basal ganglia) Brain - Spinal cord (cervical c-1) Brain - Substantia nigra Breast - Mammary Tissue Cells - EBV-transformed lymphocytes Cells - Leukemia cell line (CML) Cells - Transformed fibroblasts Cervix - Ectocervix Cervix - Endocervix Colon - Sigmoid Colon - Transverse Esophagus - Gastroesophageal Junction Esophagus - Mucosa Esophagus - Muscularis Fallopian Tube Heart - Atrial Appendage Heart - Left Ventricle Kidney - Cortex Liver Lung Minor Salivary Gland Muscle - Skeletal Nerve - Tibial Ovary Pancreas Pituitary Prostate Skin - Not Sun Exposed (Suprapubic) Skin - Sun Exposed (Lower leg) Small Intestine - Terminal Ileum Spleen Stomach Testis Thyroid Uterus Vagina Whole Blood