Estimation of cancer patients survival time based on french medical reports and using Transformers
If you want to try out the model without any installation, visit the following website. This is nothing but a demonstration site and no data is stored after the prediction is finished.
Please make sure you have python >= 3.7. Every script has to be executed at the root of the project using the -m
option (please consider the doc about using the module option).
The requirements can be installed with the following command line:
pip install -r requirements.txt
Before to continue, please make sure the following command line is correctly running. If it runs until printing "DONE" then you can move on to the next section.
python -m kmembert.training
If you want so, you can install the package using pip, else move on to the next section. At the root of the project, install it via pip:
sudo pip install .
Or in dev mode
sudo pip install -e .
The main files and folders are briefly described below. Some files that don't need to be described are not listed below.
.
├── data - folder containing the csv data files (details below)
│ └── ...
├── medical_voc - folder containing the medical vocabulary files
│ └── ...
├── results - folder storing results (see its own README)
│ └── ...
├── graphs - folder storing graphs and data viz
│ └── ...
└── kmembert - python package
├── models
│ ├── conflation.py - conflation of multiple ehrs
│ ├── health_bert.py - main transformer model
│ ├── interface.py - model class interface
│ ├── sanity_check.py
│ ├── time2vec.py
│ └── transformer_aggregator.py - transformer using multiple ehrs
├── preprocessing - folder containing preprocessing scripts
│ ├── concatenate_files.py - concatenate GR data files
│ ├── correction.py - correct mispellings on a dataset
│ ├── extract_unknown_words.py - build a medical vocabulary
│ ├── preprocess_voc.py - clean up the medical vocabulary
│ ├── split_dataset.py - splits a dataset (details below)
│ └── visualize_data.py - visualize a dataset
├── config.py - class containing the config variables
├── baseline.py - TFIDF baseline
├── dataset.py - PyTorch Dataset implementation
├── history.py - training of models using multiple ehrs
├── health_bert.py - camembert implementation
├── hyperoptimization.py - optuna hyperoptimization
├── training.py - training and validation of a model
├── testing.py - testing of a model
├── predict.py - runs a model on a few samples
├── preprocesser.py - preprocess texts
├── utils.py - utils
└── visualize_attention.ipynb - visualize bert heads attention
The data are stored inside ./data
. A folder inside ./data
corresponds to one single dataset, and has to contain four files. For example, the ./data/ehr
folder has the right format.
For more details, see ./data/README.md
.
Many scripts are available under ./kmembert/preprocessing
and all deals with preprocessing. The preprocessing pipeline is described below.
concatenate_files.py
Get all data files on the VM and create a uniform concatenated file out of it.
The final table stored in concatenate.txt
is separated with £
and has 9 columns: ["Noigr","clef","Date deces","Date cr","Code nature","Nature doct","Sce","Contexte","Texte"]
split_dataset.py
This is one of the main script. Assuming that we already have the concatenate.txt
file, it creates four other files from it: train.csv
, test.csv
, config.json
, validation_split.csv
.
The data are split according to the IGR numbers, so that no IGR number in the test set can be found within the train set. We also split the data contained in train.csv
into a train set and validation one, again with no common patient. This split is stored in the validation_split.csv
file and will be used for training.
This is also where we compute the mean survival time, which is then stored in config.json
.
We also perform some preprocessing in this script, notably:
- Filtering: we only keep EHRs with the type
"C.R. consultation"
. This step reduces the number of samples from 2,904,066 to 1,347,612 - Missing Values: we get rid of samples where one of this value is missing:
["Date deces", "Date cr", "Texte", "Noigr"]
. This step removes about 100 samples - Negative survival time: some EHRs are signed or completed after the decease, we get rid of those samples. This step removes about 5k samples
- (
visualize_data.py
)
Optional. Data visualization can be produced only after the execution of the previous scripts. Given a dataset, it will produce 6 plots highlighting : number of ehr per patient, sentence per ehr, mean token per sentence, number of tokens per ehr, number of known tokens per ehr, survival time distribution.
extract_unknown_words.py
Extract unknwon words and their occurences among a whole dataset. Unknown = according to Camembert vocabulary.
It creates a json file under ./medical_voc
.
preprocess_voc.py
Modifies a medical_voc json previously created. It removes duplicates (words that exist with and without an s
) and misspelled words.
correction.py
Corrects a dataset using sym_spell to remove the main misspellings and accents missings.
Once a clean dataset is created according to the previous section, one can train a model. It retrains a pre-trained camembert model on a given csv dataset for a classification, regression or density task.
It creates a folder in ./results
where results are saved. See ./results/README.md
for more details.
python -m kmembert.training <command-line-arguments>
Execute python -m kmembert.training -h
to know more about all the possible command line parameters (see below).
-h, --help
show this help message and exit
-d DATA_FOLDER, --data_folder DATA_FOLDER
data folder name
-m {regression,density}, --mode {regression,density}
name of the task
-b BATCH_SIZE, --batch_size BATCH_SIZE
dataset batch size
-e EPOCHS, --epochs EPOCHS
number of epochs
-drop DROP_RATE, --drop_rate DROP_RATE
dropout ratio. By default, None uses p=0.1
-nr NROWS, --nrows NROWS
maximum number of samples for training and validation
-k PRINT_EVERY_K_BATCH, --print_every_k_batch PRINT_EVERY_K_BATCH
prints training loss every k batch
-f [FREEZE], --freeze [FREEZE]
whether or not to freeze the Bert part
-dt DAYS_THRESHOLD, --days_threshold DAYS_THRESHOLD
days threshold to convert into classification task
-lr LEARNING_RATE, --learning_rate LEARNING_RATE
model learning rate
-r_lr RATIO_LR_EMBEDDINGS, --ratio_lr_embeddings RATIO_LR_EMBEDDINGS
the ratio applied to lr for embeddings layer
-wg WEIGHT_DECAY, --weight_decay WEIGHT_DECAY
the weight decay for L2 regularization
-v VOC_FILE, --voc_file VOC_FILE
voc file containing camembert added vocabulary
-r RESUME, --resume RESUME
result folder in which the saved checkpoint will be reused
-p PATIENCE, --patience PATIENCE
number of decreasing accuracy epochs to stop the training
For example, the following command line gets the csv files inside ./data/ehr
, uses the first 10,000 rows, and uses the density mode.
python -m kmembert.training --data_folder ehr --nrows 10000 --mode density
You can also execute the following scripts:
kmembert.preprocessing.<every-file>
runs preprocessing (every file under preprocessing can be run)kmembert.history
runs a multi-ehr model, e.g. kmembert-T2 or kmembert-conflationkmembert.testing
tests a modelkmembert.hyperoptimization
runs hyperoptimizationkmembert.baseline
runs the baselinekmembert.predict
runs the model on a few samples
Using the following command line
python -m kmembert.<filename> <command-line-arguments>
Execute python -m kmembert.<filename> -h
to know more about all the possible command line parameters.
Models are automatically loaded while training if the --resume
parameter is provided. If you want to load it on another script, it is done automatically during instance creation, e.g.
config = ... # initialization with utils.create_session, make sure config.resume is a result folder name
model = HealthBERT(device, config)
Warning, loading a kmembert-T2 model will not load the kmembert-base model used under the hood.
You can also see details about the saved checkpoints in ./results/README.md
There is a bash script at the root of the project: test.sh
. This script aims at testing that the main python scripts runs smoothly. It basically runs a bunch of python scripts and check that they do not return any errors.
To run the script, execute the following command:
bash test.sh
If you want a more exhaustive testing (it is going to take more time), execute the following command:
bash test.sh long