cfframe / MetadataClassifier

Part of SERUMS project (copy)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ReadMe for MetadataClassifier

Overview

This relates to WP2.

Description

This repo contains mechanisms for further training a pre-trained BERT model using health metadata and TPOLE classifications, and for using the resultant state dictionary to make a first pass at classifying new metadata. It also includes a bespoke Naïve Bayes classifier for comparison.

Metadata in this context refers to fields with descriptors and associated source name (e.g. name of a source database table or database view). The input for training is such metadata that has been pre-processed, with the final format being rows comprising category (based on TPOLE plus 'key') and processed text, separated by a comma. The 'text' value of each row should have no punctuation.

A full list of available pre-trained BERT models can be found at: https://huggingface.co/transformers/v3.3.1/pretrained_models.html

Usage

Primary scripts:

  • bert_train.py
  • bert_train_harness.py
  • bert_predict.py
  • naive_bayes.py

Examples below assume the script is being run from the main application directory.

Help text available by running py script_name.py -h in the terminal.

bert_train.py.

Train model against data and set parameters as below.

Arguments:

-sdd, --src_data_dir: Path to source data directory. Default: DEFAULT_SRC_DIR, as defined in the script.

-dfn, --data_file_name: Source data file name. Default: DEFAULT_DATA_FILE_NAME, as defined in the script.

-tfn, --test_file_name: Optional source test file name. Default: empty string.

-l, --labels_file_name: Source labels file name. Default: DEFAULT_LABELS_FILE_NAME, as defined in the script.

-td, --target_dir: Working directory for saving files etc. Default: parent directory of the script.

-bm, --bert_model: Pre-trained BERT model. Default: bert-base-uncased

-do, --dropout: Dropout. Default: 0.5

-lr, --learning_rate: Learning rate. Default: 1e-6

-bs, --batch_size: Batch size. Default: 5

-ep, --num_epochs: Number of training epochs. Default: 100

-d, --device: Compute device to use. Default: -1, for cpu

--save_prefix: Path prefix to save models (optional).

--to_archive: Flag to create an archive file of results. (True if flag present, default False if not).

-am, --to_archive_model: Flag to create an archive file of the model; assume to_archive is true. Default: False.

Example:

python bert_train.py -sdd .data -dfn serums_fcrb_Tokenized.csv -l serums_fcrb_labels.txt -lr 1e-6 -bs 2 -ep 3 -d 0 --save_prefix xx --to_archive --to_archive_model

The outputs include:

  • images folder, containing graphs (Category frequencies, Accuracy and Loss)
  • labels.txt - list of categories derived from the source data
  • models.txt - model definition(s)
  • test_result.txt - final result based on test data not used in the training
  • train_val_results.txt - data captured during training and used to generate the graphs
  • *_model_*.pt - trained model
  • *_state_dict_*.pt - state dictionary of the trained model

bert_train_harness.py

Harness for running bert_train.py through all permutations of variations in selected arguments.

Arguments:

-sdd, --src_data_dir: Path to source data directory. Default: DEFAULT_SRC_DIR, as defined in the script.

-dfn, --data_file_names: List of source data file names. Default: DEFAULT_DATA_FILE_NAMES, as defined in the script.

-dfn, --test_file_names: Optional list of source test file names. If supplied, must be paired up with and in same order as source data file names. Default: empty list.

-l, --labels_file_name: List of source labels file names, corresponding to data file names. Default: DEFAULT_LABELS_FILE_NAME, as defined in the script.

-td, --target_dir: Working directory for saving files etc. Default: parent directory of the script.

-bm, --bert_model: List of pre-trained BERT models. Default: ['bert-base-uncased']

-do, --dropout: List of dropout rates. Default: [0.5]

-lr, --learning_rate: List of learning rates. Default: ['1e-6']

-bs, --batch_size: List of batch sizes. Default: ['5']

-ep, --num_epochs: List of numbers of training epochs. Default: ['100']

-d, --device: Compute device to use. Default: -1, for cpu

-r, --to_run_sub_scripts: Whether to run each sub-script; otherwise just print the command. Default: False (if flag not present).

--to_archive: Flag to create an archive file of results. Default: False (if flag not present).

-am, --to_archive_model: Flag to create an archive file of the model; assume to_archive is true. Default: False.

Outputs as per bert_train.py.

bert_predict.py

Classifies meta data. Requires:

  • data to be classified
  • trained model's state dictionary
  • labels list as used for training the model
  • the name of the pre-trained BERT model The data must have 'category' and 'text' columns. It can optionally have other columns too.

Arguments:

-sdd, --src_data_dir: Path to source data directory. Default: TEST_SRC_DIR, as defined in the script.

-dfn, --data_file_name: Source data file name. Default: TEST_DATA_FILE_NAME, as defined in the script.

-l, --labels_file_name: List of source labels file names, corresponding to data file names. Default: TEST_LABELS_FILE_NAME, as defined in the script.

-sdp, --state_dict_path: Saved state dictionary of the trained model.

-bm, --bert_model: Pre-trained BERT model used in training. Default: ['bert-base-uncased']

-td, --target_dir: Working directory for saving files etc. Default: parent directory of this script.

-d, --device: Compute device to use. Default: -1, for cpu

--save_prefix: Path prefix to save outputs (optional).

--to_archive: Flag to create an archive file of outputs. (True if flag present, default False if not).

naive_bayes.py

Naïve Bayes classifier. Takes same data file inputs as bert_train.py.

The main data file is used for building the model, and the test data file is used for validation. There is no separation between validation and test in this instance because there is no re-training of the model - the classifier is a static model. Consequently, the classifier runs once only against a specific set of data.

Laplacian probability is used to accommodate instances where validation data contains terms that are not in the original vocabulary (and which would otherwise result in a zero probability simultaneously for being present and not being present).

The output is information regarding training and validation accuracy, plus saved model vocabularies. It is not envisaged that these would be used.

Arguments:

-s, --src_path: Source path for processing. Default: DATA_DIR, as defined in the script.

-od, --output_dir: Working directory for saving files etc. Default: Parent directory of the script.

-td, --train_data: Training data.

-vd, --val_data: Validation data.

-c, --classes: All classes. Default: tpole_labels.txt.

-sl, --save_label: Optional label to add to save name for easier identification.

-sm, --to_save_model: Flag for whether to save the model vocabulary.

-v, --verbose: Flag for whether to print lots of text data.

General Python project basics

Tools and technologies used:

  • PyCharm 2021.2.3 - 2021.3.2
  • python 3.8.10 - packages as listed in requirements.txt

Set up

Assumes NVIDIA CUDA 11.3 and NVIDIA CuDNN already installed.

Python package requirements are defined in requirements.txt. We used a virtual environment for installing these to reduce the risk of package dependency issues.

One way of installing requirements:

python -m pip install --upgrade pip
pip install -U pip setuptools wheel
pip install transformers
pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install -r requirements.txt

For further PyTorch details and dependencies, see https://pytorch.org/get-started/locally/

Code was developed in a local repository with root at MetadataClassifier level, and pushed to a deeper GitHub repo via GitHub Desktop.

Reference

Initial example code for using BERT from Text Classification with BERT in PyTorch | by Ruben (https://towardsdatascience.com/text-classification-with-bert-in-pytorch-887965e5820f)

About

Part of SERUMS project (copy)


Languages

Language:Python 100.0%