AstraZeneca / verbReduce

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Verb Cardinality Reduction for BioMedical Pred-Argument Graphs Extracted from Unstructured Text

Maturity level-Prototype PyTorch Lightning License

Introduction

Predicate-Argument Graphs extracted from the unstructured text have a high cardinality of verbs (Arguments), limiting the use of graphs. Particularly in the biomedical domain, there are no existing data sources that could use to train or map verbs. Reducing the verb count while not losing information is the key challenge.

verbReduce do not:

  • Require existing resource for Biomedical domain
  • Require 'Gold' verb set
  • Require 'K' for verbs
  • Require evaluation dataset

Given the unlabeled data, our approach provides a lookup table mapping source verb to target verb.

Architecture Diagram

Untitled Diagram drawio (7)

Setup

Run the following to setup the code

make install-dependencies

Tests:

pytest -s

Running the code

We use the external libraries in the code, so it is useful to be familiar with the way these libraries work. The libraries are:

Environment Variables:

  • export PREFECT_HOME=<path where you have enough space>
    • Prefect stores output from task in local disk. Make sure to provide path where there is enough space.
  • export TOKENIZERS_PARALLELISM=false
    • This is to disable the warning messages thrown by HuggingFace Tokenizers
  • export ENV_FOR_DYNACONF=default
    • This is to set the which environment variables you are going to run from the settings.local.toml (This file is not tracked by git and can vary with each local config). Please refer to this link for further information.

Links

Understanding the code

Addressing the challenge in three parts:

Features to be implemented

  • Support multi-gpu training/inference (currently the code only supports one gpu)
  • Use context in verb substitution prediction
  • Deal with multiple token verbs. (currently the approach only uses found in vocabulary. If the verb is split into two tokens, we ignore it)

About

License:Apache License 2.0


Languages

Language:Python 99.8%Language:Makefile 0.2%