kaggle nlp text-classification universal-sentence-encoder text tensorflow deep-learning natural-language-processing

Medical Abstract Segmentation

Overview

Abstracts from medical research papers can be challenging to read at a glance as they contain complex wording, densely represented in a single paragraph. What if there was a way to segment these abstracts so that they become optimized for speed reading (skimmable)?

The purpose of this notebook is to explore building a Natural Language Processing (NLP) model with TensorFlow to segment text lines of abstracts from medical research papers in order to improve the readability of these said abstracts while maintaining a compute efficiency & implementation complexity constraint (CPU-only and simple implementation).

The dataset used to train the NLP model is based on a paper titled "PubMed 20k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts", published in October 2017.

The NLP model architecture used in this notebook is inspired by this paper titled "Neural Networks for Joint Sentence Classification in Medical Paper Abstracts" (also mentioned in the dataset paper), published in December 2016. Note that the model implemented in this notebook aims to reproduce similar results as seen in the aforementioned paper.

Dataset paper: PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts

Model architecture paper: Neural Networks for Joint Sentence Classification in Medical Paper Abstracts

Dataset: PubMed 200k RCT

The dataset is available on GitHub, see the link attached below to access the dataset.

GitHub Source: https://github.com/Franck-Dernoncourt/pubmed-rct

Using the Kaggle version of the dataset

I have uploaded the dataset here on Kaggle to make it more accessible for notebook usage.

Here's the link to the Kaggle dataset: PubMed 200k RCT

Note that this version includes .csv versions of the original dataset.

Requirements

Recommended python version: Python +3.8.10 64-bit

[Note: This section will be updated in due course.]

Project Structure

.
├── LICENSE
├── README.md
└── nlp_medical_abstract_segmentation.ipynb

LICENSE | project license (MIT)
README.md | project readme file
nlp_medical_abstract_segmentation.ipynb | project notebook

Usage

See nlp_medical_abstract_segmentation.ipynb

License

This project is licensed under the terms and conditions of the MIT license.

About

A Natural Language Processing (NLP) model with TensorFlow to segment text lines of abstracts from medical research papers in order to improve readability.

https://www.kaggle.com/code/matthewjansen/nlp-medical-abstract-segmentation/notebook

kaggle nlp text-classification universal-sentence-encoder text tensorflow deep-learning natural-language-processing

MIT License

Languages

Language:Jupyter Notebook 100.0%