Critical Assessment of Function Annotation (CAFA), is a community-wide challenge designed to provide a large-scale assessment of computational methods dedicated to predicting protein function.
More information can be found at http://biofunctionprediction.org/cafa/ as well as the CAFA2 paper (Jiang et al, 2016)
This toolset provides an assessment for CAFA submissions.
For bug reports, comments or questions, please email nzhou[AT]iastate.edu.
- Python 3
- Python packages can be downloaded from their sites or installed from repositories:
$ sudo apt install python-biopython python-yaml python-matplotlib python-seaborn
We provide two main functions to assist in the evaluation of GO-term prediction within the scope of CAFA, the main assessment function and the plot function.
assessment.py
- Only input needed is the configuration file
config.yaml
, where the following parameters are specified in the first sectionassessment
. - First parameter
prediction_path
: prediction file formatted according to CAFA3 formats - Second parameter
obo_path
: path of the gene ontology obo file. The latest version can be downloaded here. Note that the obo file used here should not be older than the one used in the prediction. - Third parameter
ic_path
: path of the calculated infromation content map. This must be made prior to runningassessment.py
. This is achieved by runningIC.py
in the ICTool directory, instructions below in auxillary functions. - Fourth parameter
benchmark_path
: directory of the benchmark folder. Specific formats are required for the benchmark folder, including two sub-directories: groundtruth and lists. Please refer to auxiliary functionbenchmark_folder.py
for the creation of this folder, as well as the genral creation of benchmarks. An example benchmark folder is given in this repository./assessment/benchmark
- Fifth parameter
results_path
: Folder where results are saved. Arawdata
folder will be created within the results folder. - Seventh parameter
gaf_path
: Location of the IC File - Eigth parameter
gaf_path
: Location of the GAF File (with respect to the ICTool directory if relative) - Note that only the first section
assessment
of the configuration file is used here, the rest of the configuration file can be ignored for this function
- Only input needed is the configuration file
CAFA3 released its protein targets in September 2016. Each protein target has a unique CAFA3 ID. To run the above assessment function, each protein should be represented by its CAFA3 ID. However, the benchmark proteins generated by the benchmark creation tool are identified by UniProt Accession IDs. Therefore, we here provide functions to convert between UniProt IDs and CAFA3 IDs. We also provide a function that converts benchmark files generated by the benchmark creation tool to a benchmark folder that can feed into this program.
-
benchmark_folder.py
- Refer to
python benchmark_folder.py -h
for syntax of using this function by itself. - If using our benchmark creation tool, then the
benchmark_pipeline.sh
file is a good example of how to generate a benchmark folder forassessment.py
from the raw benchmarks. - Input your own folder names and different gaf file names in the blanks left in
benchmark_pipeline.sh
.
- Refer to
-
./ID_conversion/ID_conversion.py
- Two functions are written in this python script, one converts UniProt Accessions to CAFA3 IDs, the other function converts the other way around.
- First function
uniprotac_to_cafaid(taxon, uniprotacs)
. - Second function
cafaid_to_uniprot(taxon, cafaids)
. - Refer to comments in the script
./ID_conversion/ID_conversion.py
and third example below for usage.
-
./ICTool/IC-GAF.py
- Calculates the Information content of a given OBO
- How to use: From the ICTool directory
IC.py ../config.yaml
. - It uses the assessement config file, so once you have that setup, this will run on the same file set.
-
./ICTool/IC-LIST.py
- Calculates the Information content of a given a list of Proteins, Terms and Ontologies.
- How to use: From the ICTool directory
IC.py ../config.yaml
. - It uses the assessement config file, so once you have that setup, this will run on the same file set.
-
./ICTool/IC-MATRIX.py
- Calculates the Information content of a given a list of Proteins, Terms and Ontologies.
- How to use: From the ICTool directory
IC.py ../config.yaml
. - It uses the assessement config file, so once you have that setup, this will run on the same file set.
./assessment.py config.yaml
./ID_conversion/ID_conversion.py ./ID_conversion/example_uniprot_accession_8355.txt 8355 ./ID_conversion/example_output.txt
./benchmark_pipeline.sh
Jiang, Yuxiang, et al. "An expanded evaluation of protein function prediction methods shows an improvement in accuracy." Genome biology 17.1 (2016): 184.