Source code to train and validate two text classifier models - based on FastText vs Deep Learning using a 1-D Convolutional Neural Network with pretrained wordembeddings - on the corpus of cancer clinical trial protocols published in clinicaltrial.gov. The models classifies short free-text sentences (describing clinical information like medical history, concomitant medication, type and features of tumor, cancer therapy etc.) as eligible or not eligible criterion to volunteer in clinical trials.
Both models are evaluated using cross-validation with k-folds = 5 on incremental sample sizes (1K, 10K, 100K, 1M samples) and on the largest balanced set (using undersampling) with 4.01 M classified samples available (from a total of 6 M samples) .
Read original paper: Bustos, A.; Pertusa, A. Learning Eligibility in Cancer Clinical Trials Using Deep Neural Networks. Appl. Sci. 2018, 8, 1206. www.mdpi.com/2076-3417/8/7/1206
gensim 0.13.4.1
h5py 2.6.0
Keras 1.2.2
matplotlib 2.0.0
num2words 0.5.4
numpy 1.12.0
pandas 0.19.2
protobuf 3.2.0
pyparsing 2.1.10
python 3.6.0
readline 6.2
scikit-learn 0.18.1
scipy 0.18.1
sklearn 0.0
tensorflow-gpu 1.0.0 (note: GPU installation is optional, but highly recommended to train the CNN model in less than 5 hours)
Note: A pregenerated subsample file with 1M samples and 186 MB is available for download in https://www.kaggle.com/auriml/eligibilityforcancerclinicaltrials
If the pregenerated subsample dataset is used rename it and save it as './textData/labeledEligibility.csv' and proceed directly to step 2, otherwise, to build a new dataset from scratch proceed to step 0
https://clinicaltrials.gov/ct2/results?term=neoplasm&type=Intr&show_down=Y
Size of folder "search_result": 1.28 GB
python preprocessor.py -b '<pathTo>/search_result/'
python preprocessor.py -l
python preprocessor.py -w
python fasttext_word_embeddings.py
python gensim_word_embeddings.py -i '<pathTo>/search_result/'
To visualize them using the TensorBoard (https://www.tensorflow.org/versions/master/how_tos/embedding_viz/) execute this script to produce the files in tensor format:
python word2vec2tensor.py --input wordEmbeddings/vectorsGensim_cbow.bin --output word2vec2tensor
Make sure that exists the file ./textData/labeledEligibility.csv
Execute:
python fasttext_text_classifier.py
The learning curves are saved in ./Learning_Curves_FastText_Classifier_plot.png
Make sure that exists the file ./textData/labeledEligibility.csv
Using tensorflow-gpu it takes aprox 5 hours to train.
python text_classifier.py
The learning curves are saved in ./Learning_Curves_CNN_Classifier_plot.png