reactome / release-orthoinference

Data-release-pipeline: Orthology projection of human reactions and pathways to non-human species

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Orthoinference: Generating Reactome's Computationally Inferred Reactions and Pathways

This module has been rewritten from Perl to Java.

Additionally, Orthoinference now generates orthologous Stable Identifiers and PathwayDiagrams. StableIdentifier generation represents the updated version of add_ortho_stable_ids.pl (from the GenerateOrthoStableIds release step). It now happens during orthoinference whenever a Pathway, ReactionlikeEvent or PhysicalEntity instance is inferred. Orthologous PathwayDiagram generation is completed after all ReactionlikeEvents and Pathways for a species have been inferred, and is based off the OrthoDiagrams release step, which itself utilized the PredictedPathwayDiagramGeneratorFromDB method from the CuratorTool repository.

The Inference Process

In a nutshell, the inference process follows this workflow:

Orthoinference Overview Image

For each species, we take all Human ReactionlikeEvents (RlE) instances (Reaction, BlackBoxEvent, Polymerisation, Depolymerisation, FailedReaction) in the release_current database that is a copy of the the slice_current database after updateStableIds has been run. For each of these RlE instances, there are a few basic rules that must be followed for an inference to be attempted. It must pass a series of filters and have at least 1 protein instance, determined using the ProteinCountUtility.

If the RlE passes all these tests, it is considered eligible for inference. Inference is first attempted on the RlE's input and output attributes, followed by catalyst and regulation inference attempts. If the input or output inferences fail, then the process is terminated for that RlE since they are required components of any ReactionlikeEvent.

During inference, the attribute (input/output/catalyst/regulation) is broken down into its individual PhysicalEntity (PE) instances. These PE instances are each run through the createOrthoEntity method. This method infers according to PE type: GenomeEncodedEntity/EntityWithAccessionedSequence, Complex/Polymer, EntitySet or SimpleEntity. In cases where the PE itself contains multiple GKInstance attributes (eg: Complexes hasComponent attribute, EntitySets hasMember attribute), these too are run through the createOrthoEntity method and inferred. Through this system, PE instances will be recursively broken down until they reach the EntityWithAccessionedSequence (EWAS) level and are inferred in the inferEWAS method.

After all valid ReactionlikeEvents instances have been inferred for a species, the final step is to populate the Pathway instances these RlEs are found in. Pathway inference takes place in the PathwaysInferrer class, and involves creating the hierarchy structure from the Human Pathway equivalent.

Once the database has been updated with inferred Pathways, orthologous PathwayDiagram generation takes place. This generates the diagrams that are visible in Reactome's Pathway Browser for each species. These diagrams are based off the existing Human diagrams in the database.

Preparing Orthoinference

Orthoinference can be run once the UpdateStableIds step has been run. Historically, it had been run following the now-deprecated MyISAM step. Before running the new Java Orthoinference code, there are a few requirements:

  • Orthopair file generation must have been completed.
  • The slice_current database will need to be dumped and restored to a new release_current database.
  • Set all values in the config.properties file
  • normal_event_skip_list.txt needs to be placed in src/main/resources/ folder.

Setting config.properties

Create or update the config.properties file in theorthoinference/src/main/resources/ folder, setting the following properties to match the current release:

### Sample config.properties file for Orthoinference
username=mysqlUsername
password=mysqlPassword
database=test_reactome_##
host=localhost
port=3306
pathToSpeciesConfig=src/main/resources/Species.json
pathToOrthopairs=path/to/orthopairs/
releaseNumber=releaseNumber
dateOfRelease=yyyy-mm-dd
personId=reactomePersonInstanceId

Orthoinference skiplists

Historically, the list of ReactionlikeEvents that are skipped during Orthoinference have been manually created by a Curator. As of August 2020, this process has been automated in two ways: 1) A static skiplist that is hard-coded into the orthoinference code and 2) through automated enforcement based on the membership of the ReactionlikeEvent in the Disease TopLevelPathway.

The static skiplist currently consists of the HIV Infection (162906), Influenza Infection (DbId: 168255) and Amyloid Fiber Formation (977225; only non-Disease skiplist instance) Pathways. Orthoinference creates a skiplist of all ReactionlikeEvents contained within these Pathways.

The automated skiplist is only focused on skipping inference for instances that are children of the Disease TopLevelPathway. If a ReactionlikeEvent only exists as a child of the Disease pathway, than inference will be skipped. In cases where a ReactionlikeEvent is a member of the Disease AND another TopLevelPathway, then Reaction inference proceeds as normal. During Pathway inference, the inference of the non-Disease pathways are allowed while the Disease Pathway inference (and its children) are suppressed for the instance.

Running Orthoinference

Note: For the orthoinference steps and QA, it is recommended that Java 8 be used and mySQL 5.5 or 5.7 be used.

Once all prerequisites have been completed, running the runOrthoinference.sh script will begin the process. This bash script performs a git pull to update the repo with any changes that may have happened between releases. It then builds the orthoinference jar file with all dependencies and then executes it for each species that will be projected to.

Note: To run orthoinference on particular species, modify the 'allSpecies' array in runOrthoinference.sh so that it only contains the species you wish to project too. Alternatively, if the jar file has been built and only one species needs to be inferred, run the following command:
java -jar target/orthoinference-[version]-jar-with-dependencies.jar [speciesCode]

  • Replace '[speciesCode]' with the 4 letter species code corresponding to the species you wish to infer too
  • Replace '[version]' with the build version for the jar (e.g. 0.0.1-SNAPSHOT)
  • Orthoinference benefits from an increased memory heap, which can be modified with the -Xmx####m tag before -jar.

During orthoinference, many files are produced:

  • Log files in the logs/ folder provide information pertaining to each inference attempt and is useful for tracing errors.
    • They are organized by time stamp.
  • eligible_(speciesCode)_75.txt lists all ReactionlikeEvents that can be inferred. This should be the same for all species.
    • The 75 refers to the percent of distinct proteins that must exist in Complex/Polymer instances for an inference attempt to continue. It is a holdover name from Perl Orthoinference.
  • inferred_(speciesCode)_75.txt lists all ReactionlikeEvents that were successfully inferred for the species.
  • report_ortho_inference_test_reactome_##.txt shows the percentage of inferences that were successful for each species.

Once the Java code has been finished, verify that all eligible_(speciesCode)_75.txt files have the same number of lines. If the line counts are different, something likely went wrong during inference and will need to be investigated.

Finally, once you have verified that orthoinference seems to have run correctly, run updateDisplayName.pl.

Verifying Orthoinference

Once all scripts in the previous step have been run there is a QA process that should be followed. Orthoinference is a foundational step for the Reactome release pipeline, and ensuring that this process worked as expected will save much time later in the Release process if anything erroneous happened.

Recommended QA

  1. Compare line counts of the eligible_(speciesCode)_75.txt and inferred_(speciesCode)_75.txt files to the previous release. Are they relatively close to each other? If any are significantly smaller for the eligible files of a particular species, perhaps check the Orthopair files that correspond to the species of interest. Are the eligible and/or inferred line counts considerably smaller for all species? Something may have gone wrong during the inference process itself. Check log files to see if anything obvious jumps out. Otherwise, more extensive troubleshooting with the script itself will be required.

Next, we want to make sure that the new release_current database can be imported to the graphDb in Neo4j and that it reports acceptable graph QA numbers.

It is recommended that the following steps be run on your workstation.

  1. Run the graph-importer module. This should take some time (~30 minutes) and will output the results from database-checker as well as the imported graphDb in target/graph.db/.

    • The database-checker results should resemble the following:
    The consistency check finished reporting 13 as follows:
             1 entries for (Taxon, crossReference)
    
    Valid schema class check finish reporting 1 entry
    

    The database-checker module just looks for any attributes of an instance that are required (as determined by the data model) and are not filled. Small numbers reported are OK but any newly reported entries should be investigated.

  2. Finally, running the graph-qa step will check a series of graphDB QA items and rank them by urgency. To run graph-qa, you will need to have an instance of neo4j running with the imported graph DB. To quickly get neo4j installed and running, a docker container is recommended. The following command can be used to get it running quickly and painlessly:

    docker run -p 7687:7687 -p 7474:7474 -e NEO4J_dbms_allow__upgrade=true -e NEO4J_dbms_allowFormatMigration=true -e NEO4J_dbms_memory_heap_max__size=XXG -v $(pwd)/target/graph.db:/var/lib/neo4j/data/databases/graph.db neo4j:3.4.9

    • Make sure that the location of the graph.db/ folder is properly specified in the last argument
    • Adjust the NEO4J_dbms_memory_heap_max__size argument so that it is appropriate for your computer/server.

To verify that the graphDb has been properly populated, open localhost:7474 (username and password default is neo4j), and click on the database icon at the top left. A panel titled Database Information should open up and display all nodes in the Data Model. If none of this appears, chances are the neo4j instance is not using the imported graphDB. Verify the graph.db/ folder is in fact populated.

To verify that the graphDb has been properly populated, open localhost:7474 in your browser once the docker container is built (username and password default is neo4j), and click on the database icon at the top left. A panel titled Database Information should open up and display all nodes in the Data Model. If none of this appears, chances are the neo4j instance is not using the imported graphDB. Verify the graph.db/ folder is in fact populated and make sure the location of the graph.db/ folder is properly specified in the docker command's last argument.

Once you have verified that the graphDb is populated, graph-qa can be run. Build the jar file using mvn clean compile package and then follow the instructions at the repo. After graph-qa has been run, it will output the number of offending instances for each QA category, ranked by urgency/priority. These results can be compared with the QA results from the previous release. QA reports can be found here.

  1. Once you are satisfied with the results from each of these steps, send the graph-qa and database-checker results to the Curator overseeing release. Database-checker can be re-run fairly painlessly by following the instructions on its GitHub page. If the Curator is satisfied with the QA results, you can move onto one of the next steps of release. At this point in the release pipeline that could be UpdateDOIs or AddLinks.

Additional Orthoinference troubleshooting

Below are suggestions for further checking that the Orthoinference process was correctly run. They don't provide as much information or have a specific SOP though, making them optional QA processes.

  • The WebELVTool found in the CuratorTool repo can be used to check the validity of the release_current database. The WebELVTool jar is built by running an ant build on the WebELVDiagram.xml file. The jar file will appear in the WebELVTool/ folder above the CuratorTool/ folder. To run the program, navigate to the directory with the jar file and run:
    java -jar webELVTool.jar (host) (release_current) (mysql_username) (mysql_password) (mysql_port) (reactome_author_ID)
    The output should be many lines of 'Predicting pathway' or 'Working on pathway'. If the script runs successfully, then it is implied release_current is appropriately populated.
  • The CuratorTool program (different from the CuratorTool repository mentioned above), can be downloaded from the Reactome website here can also be used to explore the inferred data. There isn't a recommended set of tests or checks, but familiarity with the CuratorTool can be quite useful for determining if instances are populated correctly.

About

Data-release-pipeline: Orthology projection of human reactions and pathways to non-human species


Languages

Language:Java 98.3%Language:Shell 1.3%Language:Dockerfile 0.4%