BERT-based Biomedical Text Summarizer
-
Download version 1 or version 2 of the BERT-based biomedical text summarizer.
- Version 1 uses Euclidean distance in the clustering step.
- Version 2 uses Cosine similarity in the clustering step.
-
Extract the zip file.
-
Download the BERT repository from https://github.com/google-research/bert, and copy the files to the BERT directory already available with the summarizer.
-
Download a BERT pretrained model from https://github.com/google-research/bert or a BioBERT pretrained model from https://github.com/naver/biobert-pretrained, and copy the files to the BERT directory already available with the summarizer.
-
Copy your input document (preferably a txt file) to the INPUT directory already available with the summarizer.
-
Run the following script:
- python Summarizer.py -i INPUT_FILE_NAME -o OUTPUT_FILE_NAME -c COMPRESSION_RATE -k NUMBER_OF_CLUSTERS
-
Four parameters must be specified when running the script:
- INPUT_FILE_NAME is the name of input file already copied to the INPUT directory.
- OUTPUT_FILE_NAME is the name of output file containing the summary that will be created in the OUTPUT directory.
- COMPRESSION_RATE specifies the size of summary and takes a value in the range (0, 1).
- NUMBER_OF_CLUSTERS specifies the number of final clusters in the clustering step.
-
After finishing the summarization process, the summary can be found in the OUTPUT directory already available with the summarizer.
Example
The following script uses the file Input.txt as the input, runs the summarizer with a compression rate of 30 percent and a final cluster number of 4, and finally stores the summary in the file Output.txt:
python Summarizer.py -i Input.txt -o Output.txt -c 0.3 -k 4
Final evaluation results
ROUGE-1 | ROUGE-2 | |
BERT-based summarizer (BERT-large) | 0.7504 | 0.3312 |
BERT-based summarizer (BioBERT-pubmed+pmc) | 0.7411 | 0.3228 |
BERT-based summarizer (BioBERT-pubmed) | 0.7376 | 0.3203 |
CIBS biomedical summarizer | 0.7345 | 0.3187 |
BERT-based summarizer (BioBERT-pmc) | 0.7309 | 0.3164 |
Bayesian biomedical summarizer | 0.7288 | 0.3143 |
BERT-based summarizer (BERT-base) | 0.7257 | 0.3110 |
SUMMA | 0.7098 | 0.3022 |
TexLexAn | 0.6982 | 0.2979 |
Lead baseline | 0.6116 | 0.2311 |
Random baseline | 0.5667 | 0.1999 |
Parameterization results (Euclidean distance)
BERT-base | BERT-large | BioBERT-pmc | BioBERT-pubmed | BioBERT-pubmed+pmc | ||||||
K | R-1 | R-2 | R-1 | R-2 | R-1 | R-2 | R-1 | R-2 | R-1 | R-2 |
2 | 0.7221 | 0.3087 | 0.7434 | 0.3264 | 0.7243 | 0.3094 | 0.7269 | 0.3122 | 0.7369 | 0.3195 |
3 | 0.7291 | 0.3133 | 0.7457 | 0.3285 | 0.7308 | 0.3172 | 0.7361 | 0.3186 | 0.7429 | 0.3265 |
4 | 0.7224 | 0.3107 | 0.7507 | 0.3329 | 0.7299 | 0.3189 | 0.7354 | 0.3187 | 0.7399 | 0.3234 |
5 | 0.7205 | 0.3114 | 0.7467 | 0.3302 | 0.7272 | 0.3138 | 0.7293 | 0.3183 | 0.7398 | 0.3229 |
6 | 0.7199 | 0.3099 | 0.7415 | 0.3249 | 0.7239 | 0.3134 | 0.7276 | 0.3146 | 0.7352 | 0.3199 |
7 | 0.7157 | 0.3075 | 0.7366 | 0.3208 | 0.7187 | 0.3097 | 0.7226 | 0.3111 | 0.7313 | 0.3170 |
8 | 0.7179 | 0.3079 | 0.7334 | 0.3183 | 0.7194 | 0.3089 | 0.7198 | 0.3074 | 0.7272 | 0.3122 |
9 | 0.7146 | 0.3084 | 0.7291 | 0.3173 | 0.7183 | 0.3099 | 0.7174 | 0.3062 | 0.7273 | 0.3087 |
10 | 0.7127 | 0.3054 | 0.7284 | 0.3137 | 0.7186 | 0.3102 | 0.7162 | 0.3036 | 0.7196 | 0.3080 |
11 | 0.7063 | 0.2990 | 0.7257 | 0.3089 | 0.7148 | 0.3161 | 0.7113 | 0.2992 | 0.7164 | 0.3027 |
12 | 0.7034 | 0.2968 | 0.7203 | 0.3101 | 0.7094 | 0.3088 | 0.7087 | 0.2995 | 0.7117 | 0.3006 |
Parameterization results (Cosine similarity)
BERT-base | BERT-large | BioBERT-pmc | BioBERT-pubmed | BioBERT-pubmed+pmc | ||||||
K | R-1 | R-2 | R-1 | R-2 | R-1 | R-2 | R-1 | R-2 | R-1 | R-2 |
2 | 0.7196 | 0.3092 | 0.7328 | 0.3224 | 0.7242 | 0.3117 | 0.7177 | 0.3095 | 0.7285 | 0.3163 |
3 | 0.7169 | 0.3102 | 0.7377 | 0.3275 | 0.7249 | 0.3131 | 0.7224 | 0.3089 | 0.7328 | 0.3204 |
4 | 0.7212 | 0.3107 | 0.7362 | 0.3249 | 0.7272 | 0.3107 | 0.7268 | 0.3184 | 0.7278 | 0.3202 |
5 | 0.7152 | 0.3068 | 0.7361 | 0.3259 | 0.7212 | 0.3082 | 0.7298 | 0.3165 | 0.7295 | 0.3199 |
6 | 0.7136 | 0.3026 | 0.7299 | 0.3205 | 0.7171 | 0.3071 | 0.7261 | 0.3157 | 0.7272 | 0.3160 |
7 | 0.7107 | 0.2984 | 0.7259 | 0.3162 | 0.7173 | 0.3008 | 0.7221 | 0.3126 | 0.7224 | 0.3136 |
8 | 0.7071 | 0.2988 | 0.7231 | 0.3127 | 0.7176 | 0.3049 | 0.7207 | 0.3102 | 0.7199 | 0.3135 |
9 | 0.7037 | 0.2968 | 0.7194 | 0.3094 | 0.7119 | 0.3001 | 0.7170 | 0.3072 | 0.7182 | 0.3099 |
10 | 0.6989 | 0.2917 | 0.7173 | 0.3068 | 0.7073 | 0.2965 | 0.7143 | 0.3056 | 0.7158 | 0.3074 |
11 | 0.6953 | 0.2905 | 0.7146 | 0.3046 | 0.7035 | 0.2954 | 0.7080 | 0.2986 | 0.7126 | 0.3069 |
12 | 0.6908 | 0.2879 | 0.7142 | 0.3018 | 0.6995 | 0.2882 | 0.7033 | 0.2967 | 0.7106 | 0.3034 |