emelalkim/radiology-report-clustering

Radiology Report Clustering

This project aims to cluster de-identified radiology reports based on their similarity. Report similarity is defined in terms of their clinical findings. The main objective is to group similar reports together for further analytics, such as understanding the mix of findings for an individual radiologist or a practice. The project utilizes Natural Language Processing (NLP) techniques to preprocess the reports, compute similarities, and apply clustering algorithms.

Installation
Usage
Project Structure
Approach
Results
Contributing
License

Installation

Clone the Repository:

git clone https://github.com/emelalkim/radiology-report-clustering.git
cd radiology-report-clustering

Install Dependencies:

pip install pandas scikit-learn nltk matplotlib pandas

or (for MAC python 3)

pip3 install pandas scikit-learn nltk matplotlib pandas

Ensure RadLex Terms CSV is Available: Make sure the RADLEX_termsonly.csv file is in the project directory.

Usage

Prepare the Data: Ensure you have the de-identified radiology reports in a CSV file (e.g., reports.csv). Each report should have a clean_report column containing the report text.
Run the Script:
```
python similarity_analysis.py
```
Outputs:
- Clustering results: clustering_results_dbscan.csv
- Cluster summary: cluster_summary_dbscan.txt

Project Structure

similarity_analysis.py: Main script to preprocess reports, compute similarities, find optimal DBSCAN parameters, and perform clustering.
RADLEX_termsonly.csv: CSV file containing RadLex terms and synonyms for medical term normalization.

Approach

Data Preparation:
- Extract the FINDINGS section from each radiology report.
- Normalize medical terms using RadLex dictionary.
- Tag measurements and preprocess text to remove stopwords and special characters.
Feature Extraction:
- Compute TF-IDF vectors for the preprocessed findings sections.
Similarity Calculation:
- Calculate the cosine distance matrix between TF-IDF vectors.
Parameter Optimization:
- Plot the k-distance graph to find the optimal epsilon value for DBSCAN. This method can be used to plot the graph to see where thw knee is.
Clustering:
- Apply DBSCAN using the optimal epsilon value and cosine distance matrix.
Results Analysis:
- Generate and save clustering results and summaries.
- Display samples of the most similar and most dissimilar reports.

Results

The script outputs the clustering results in clustering_results_dbscan.csv, showing which cluster each report belongs to.
A summary of each cluster's top keywords is saved in cluster_summary_dbscan.txt.
The script also prints the most similar and most dissimilar reports based on cosine similarity.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request with your improvements.

License

This project is licensed under the MIT License. See the LICENSE file for details.

emelalkim / radiology-report-clustering