Dana-Farber / MLSym

Project Pipeline

Processing
- Process label-studio annotation output for model input
Training
Inference
- Use best model for predictions
- Copyright

How to process annotation output label/text for model input

Processing the label-studio output

python processing/label_output.py \
  --input {location of the label-studio output json files} \
  --label_config {configuration used to set up label-studio; xml file} \
  --label all OR --keep goals_or care
  --hpi \
  --stratified_split 0.3 \
  --test

Without --test argument, data will be stratified split to train/valid 0.7/0.3
With --test argument, data will be stratified split to train/valid/test 0.7/0.15/0.15
It takes around 17s to load the spacy en_core_sci_lg model, please wait.

Training

Run the models

Transformer model choices: 'bert', 'xlnet', 'roberta', 'xlm-roberta', 'camembert', 'distilbert', 'electra'

conda activate transformers
python ner.py \
  --dset {location of the data that has been converted to ConLL format} \
  --model_class electra \
  --pretrained_model google/electra-base-discriminator \
  --lr 6e-5 \
  --decay 0.02 \
  --warmups 500

Optimize the hyperparameters

Bayesian optimization with Gaussian processes
- Please open the interactive plots (contour_plot, slice_plot, cv_plot, etc) in browser

python optimization.py \
  --model bert \
  --lr 1e-6 1e-4 \
  --decay 0.01 0.1 \
  --warmups 0 3000 \
  --eps 1e-9 1e-7

Load model outputs back into server hosting label studio - for active learning

python processing/model_output.py \
  --model_output processing/output/symptoms_hpi_all/prediction_test.txt \
  --label_output_dir symptoms/storage/label-studio/project/completions/ \
  --label_config symptoms/storage/label-studio/project/config.xml

Inference

Use raw csv files with a column containing clinical note - no need to convert into ConLL format.

python inference/run_and_predict.py -ipf {location of the input file} -opf {location of dummy output file} -cn {name of the column containing the clinical note}

Copyright

All codes are modified from

About

GNU General Public License v2.0

Languages

Language:Python 100.0%