This repository contains files that were used to execute protein thermostability prediction using sequence representations from a protein language model research.
Using curl
18534 HTML UniProt search results in Proteomes
database were downloaded. Downloaded HTML files were parsed
to extract UniParc IDs.
./scripts/data_download/get_UniProt_results_HTML.sh \
data/002/TSV/temperature_data.tsv
The results were filtered using ggrep
command to find
UniParc identifiers for each organism:
ggrep -oP '/proteomes/UP.........' data/003/HTML/*.html > \
data/003/proteome_UniParc_IDs_non_redundant_no_excluded.txt
14537 UniParc IDs were saved.
cat data/003/proteome_UniParc_IDs_non_redundant_no_excluded.txt | \
tr ':' '\t' | sed 's/\/proteomes\///g' | sed 's/data\/003\/HTML\///g' | \
sed 's/\.html//g' > \
data/003/proteome_TaxIDs_UniParc_IDs_non_redundant_no_excluded.tsv
There were 6411 unique [TaxID]_[Domain]_[Temperature_label]
names in the
list. The list contained several TaxIDs that referred to
different species. For example, the taxonomy identifier 996 was assigned
to Flavobacterium columnare (temperature 23 degrees Celsius),
Flexibacter columnaris (temperature 23 degrees Celsius), and
Cytophaga columnaris (temperature 21 degrees Celsius).
It was decided to keep a single proteome for one taxonomy identifier. There were 5787 proteomes left in the final list:
awk -F"\t" '!_[$1]++' \
data/003/proteome_UniParc_IDs_non_redundant_no_excluded.tsv | \
awk -F"\t" '!_[$2]++' > data/003/003.tsv
Sorting based on temperature:
cat data/003/003.tsv | tr '_' '\t' | sort -n -k3 | \
awk '{print $3"_"$1"_"$2"\t"$4}' > data/003/003_sorted.tsv
cp data/003/003_sorted.tsv data/003/003.tsv
diff data/003/003_sorted.tsv data/003/003.tsv
rm data/003/003_sorted.tsv
An example query to download a non-redundant proteome:
curl "https://www.uniprot.org/uniprot/?query=proteome:UP000053199&format=fasta"
The proteome download script is scripts/data_download/get_UniProt_proteomes.sh
.
Example usage:
./scripts/data_download/get_UniProt_proteomes.sh data/003/003.tsv
The data set was divided into training, validation, and testing subsets, which were approximatelly balanced and had non-overlapping taxonomy identifiers between the sets.
Picking 111 proteomes from each class and shuffling the list of proteomes:
ls data/003/FASTA/ | tr '_' '\t' | head -n -6 | sort -n -k1 | \
awk '$1>=65{ print $1"_"$2"_"$3}' | shuf | \
tail -n 111 > data/003/class_1_111_proteomes.txt
ls data/003/FASTA/ | tr '_' '\t' | head -n -6 | sort -n -k1 | \
awk '$1<65{ print $1"_"$2"_"$3}' | shuf | \
tail -n 111 > data/003/class_0_111_proteomes.txt
The script scripts/003_preembeddings_v2.sh
takes the following arguments:
- a list of proteomes of a certain class
- the initial index
- the number of files to take from the list
- the prefix of FASTA files (the directory of the proteomes)
./scripts/003_preembeddings_v2.sh data/003/class_1_111_proteomes.lst 0 77 \
data/003/FASTA/ > data/003/FASTA/training_v2.fasta
./scripts/003_preembeddings_v2.sh data/003/class_1_111_proteomes.lst 77 17 \
data/003/FASTA/ > data/003/FASTA/validation_v2.fasta
./scripts/003_preembeddings_v2.sh data/003/class_1_111_proteomes.lst 94 17 \
data/003/FASTA/ > data/003/FASTA/testing_v2.fasta
./scripts/003_preembeddings_v2.sh data/003/class_0_111_proteomes.lst 0 32 \
data/003/FASTA/ >> data/003/FASTA/training_v2.fasta
./scripts/003_preembeddings_v2.sh data/003/class_0_111_proteomes.lst 32 8 \
data/003/FASTA/ >> data/003/FASTA/validation_v2.fasta
./scripts/003_preembeddings_v2.sh data/003/class_0_111_proteomes.lst 40 11 \
data/003/FASTA/ >> data/003/FASTA/testing_v2.fasta
Set | # of proteomes (overall, class_0, class_1) | # of proteins (overall, class_0, class_1) |
---|---|---|
training | 109, 32, 77 | 288996, 145128, 143868 |
validation | 25, 8, 17 | 65820, 33204, 32616 |
testing | 28, 11, 17 | 74508, 38263, 36245 |
FASTA-splitter program was used to divide each of the sets into portions:
../programs/fasta-splitter.pl --n-parts 30 --out-dir data/003/FASTA/training_v2/ \
--nopad data/003/FASTA/training_v2/training_v2.fasta
../programs/fasta-splitter.pl --n-parts 7 --out-dir data/003/FASTA/validation_v2/ \
--nopad data/003/FASTA/validation_v2/validation_v2.fasta
../programs/fasta-splitter.pl --n-parts 8 --out-dir data/003/FASTA/testing_v2/ \
--nopad data/003/FASTA/testing_v2/testing_v2.fasta
sbatch --array=1-30 --output=training_v2.part-%a.slurm-%A_%a.out \
scripts/003/003_embeddings_training_v2.sh
sbatch --array=1-7 --output=validation_v2.part-%a.slurm-%A_%a.out \
scripts/003/003_embeddings_validation_v2.sh
sbatch --array=1-8 --output=testing_v2.part-%a.slurm-%A_%a.out \
scripts/003/003_embeddings_testing_v2.sh
Set | # of all proteins embeddings | # of class_0 embeddings | # of class_1 embeddings |
---|---|---|---|
training | 284309 (288996) | 141602 (145128) | 142707 (143868) |
validation | 65156 (65820) | 32793 (33204) | 32363 (32616) |
testing | 73663 (74508) | 37749 (38263) | 35913 (36245) |
./scripts/003/003_visualisation.py
In order to make embeddings data more universal for processing with various software tools, it was decided to save embeddings in CSV files. The files are headerless. The values are listed in order of:
- #0 Taxonomy ID of the organism, to which the sequence belongs
- #1 Accession ID of the sequence
- #2 Length of the sequence
- #3 Temperature label
- #4 - #1283 Components of embeddings
./scripts/003/003_convert_NPZ_to_CSV_training.py > data/003/CSV/training_v2.csv
./scripts/003/003_convert_NPZ_to_CSV_validation.py > data/003/CSV/validation_v2.csv
./scripts/003/003_convert_NPZ_to_CSV_testing.py > data/003/CSV/testing_v2.csv
File (in data/003/CSV ) |
Size (GB) |
---|---|
training_v2.csv | 3.034 |
validation_v2.csv | 0.696 |
testing_v2.csv | 0.786 |
./scripts/003/003_classificator_inferences_testing.py
A function calculate_MCC()
in scripts/file_actions.py
was
implemented to calculate MCC. The function
takes in a TSV file in which there are two columns: a column of
true temperature labels (normalised by division of 100)
and a column of predictions (values from 0 to 1). The function
was called in script scripts/003/003_calculate_MCC.py
.
./scripts/003/003_calculate_MCC.py -p \
results/SLP/003/validation_predictions_epoch_5.tsv --true-label-idx 0 \
--prediction-idx 1 --threshold 0.65 --header
./scripts/003/003_calculate_MCC.py -p \
results/SLP/003/testing_v2_predictions.tsv --true-label-idx 2 \
--prediction-idx 3 --threshold 65
MCC for 003 v2 validation (final epoch) was: 0.84278. MCC for 003 v2 testing was: 0.8478.
Commands to plot protein sequence distribution histogram for the subsets:
./scripts/003/003_plot_histogram.py data/003/TSV/training_v2_tensors.tsv 3 \
data/003/visualisation_v2/training_v2_histo.png Training
./scripts/003/003_plot_histogram.py data/003/TSV/validation_v2_tensors.tsv 3 \
data/003/visualisation_v2/validation_v2_histo.png Validation
./scripts/003/003_plot_histogram.py data/003/TSV/testing_v2_tensors.tsv 3 \
data/003/visualisation_v2/testing_v2_histo.png Testing