This repository contains development datasets for P2Rank ion binding site prediction.
P2Rank repo: https://github.com/rdk/p2rank
Directories:
_incoming
original received files before tast/train splits_processing
... contains notes related to prosessing of datasetsZN
,MG
, ... datasets for individual ions
Dataset files are based on 2 main datasets: ZN_ppu
and ZN_benchmark
:
Description
- ppu: It contains X-ray structures from PDB, with a resolution of 3.0 Å or better, that include at least one (ion) ligand of interest. The dataset has been treated for protein redundancy by applying a sequence identity cutt-off of 30%, and for chain redundancy with UniProt, so as to discard identical chains within a single protein. It has been filtered to exclude structures from the benchmark dataset.
- benchmark: This dataset has been used in the literature for evaluating previous predictors. It contains a sub-set of ion-binding proteins from the BioLip database, with a pairwise sequence identity below 30% and a length over 50 residues. The dataset collectively includes proteins that bind nine metal ligands.
- biolip: TODO
Each of them contain _exposed
and _buried
version.
Description
- buried: It contains (ion) ligands that are located below the computed solvent accessible surface area of the protein structure that they bind to, and are thus considered inaccessible to the solution.
- exposed: It contains (ion) ligands that have a non-zero computed solvent accessible surface area, and are thus at least partially exposed to the solution.
There is also _all
version for ZN_ppu
(and for ZN_benchmark
).
Note: ZN_ppu_all.ds
is not just union of lines from ZN_ppu_buried.ds
and ZN_ppu_exposed.ds
because a single protein can contain both buried and exposed ions.
Consequently, dataset file lines for such protein differ in buried and exposed .ds
files.
Main datasets (exposed/buried split)
ZN_benchmark_exposed.ds
ZN_benchmark_buried.ds
ZN_benchmark_all.ds
ZN_ppu_exposed.ds
ZN_ppu_buried.ds
ZN_ppu_all.ds
Additionaly there are datasets that represent train/test split of ZN_ppu
.
First, single random split of ZN_ppu_all.ds
was performed: split "A" (train:66%, test:34%):.
Then ZN_ppu_exposed.ds
and ZN_ppu_buried.ds
were split according to subsets of proteins from split A.
ZN_ppu_all_Atrain.ds
ZN_ppu_all_Atest.ds
ZN_ppu_exposed_Atrain.ds
: subset ofZN_ppu_buried.ds
containing only proteins fromZN_ppu_all_Atrain.ds
ZN_ppu_exposed_Atest.ds
: subset ofZN_ppu_exposed.ds
containing only proteins fromZN_ppu_all_Atest.ds
ZN_ppu_buried_Atrain.ds
: subset ofZN_ppu_buried.ds
containing only proteins fromZN_ppu_all_Atrain.ds
ZN_ppu_buried_Atest.ds
: subset ofZN_ppu_buried.ds
containing only proteins fromZN_ppu_all_Atest.ds
Consequently, it is possible to train and test on different versions (all
/exposed
/buried
) in a single run:
e.g. train on ZN_ppu_all_Atrain.ds
and test on ZN_ppu_exposed_Atest.ds
or ZN_ppu_buried_Atest.ds
.
Similarly as in ZN datasets:
- There are 2 main datasets:
MG_ppu
andMG_benchmark
. - Each of them contain
_exposed
and_buried
version.
There is also _all
version for MG_ppu
(and for MG_benchmark
).
Note: MG_ppu_all.ds
is not just union of lines from MG_ppu_buried.ds
and MG_ppu_exposed.ds
because a single protein can contain both buried and exposed ions.
Consequently, dataset file lines for such protein differ in buried and exposed .ds
files.
Main datasets (exposed/buried split)
MG_benchmark_exposed.ds
MG_benchmark_buried.ds
MG_benchmark_all.ds
MG_ppu_exposed.ds
MG_ppu_buried.ds
MG_ppu_all.ds
Similarly as in ZN datasets: Additionaly there are datasets that represent train/test split of MG_ppu
.
First, single random split of MG_ppu_all.ds
was performed: split "A" (train:66%, test:34%):.
Then MG_ppu_exposed.ds
and MG_ppu_buried.ds
were split according to subsets of proteins from split A.
MG_ppu_all_Atrain.ds
MG_ppu_all_Atest.ds
MG_ppu_exposed_Atrain.ds
: subset ofMG_ppu_buried.ds
containing only proteins fromMG_ppu_all_Atrain.ds
MG_ppu_exposed_Atest.ds
: subset ofMG_ppu_exposed.ds
containing only proteins fromMG_ppu_all_Atest.ds
MG_ppu_buried_Atrain.ds
: subset ofMG_ppu_buried.ds
containing only proteins fromMG_ppu_all_Atrain.ds
MG_ppu_buried_Atest.ds
: subset ofMG_ppu_buried.ds
containing only proteins fromMG_ppu_all_Atest.ds
Consequently, it is possible to train and test on different versions (all
/exposed
/buried
) in a single run:
e.g. train on MG_ppu_all_Atrain.ds
and test on MG_ppu_exposed_Atest.ds
or MG_ppu_buried_Atest.ds
.
Split B is renamed split A from Christos.
MG_ppu_all_Btrain.ds
MG_ppu_all_Btest.ds
MG_ppu_exposed_Btrain.ds
MG_ppu_exposed_Btest.ds
MG_ppu_buried_Btrain.ds
MG_ppu_buried_Btest.ds