This project involves segmenting TED Talks into clusters and performing topic modeling on each cluster. The goal is to identify the main themes or topics in each cluster of TED Talks. The project is implemented in a Jupyter notebook.
- Python
- Jupyter
- NLTK
- Gensim
- PyLDAvis
- Seaborn
- Matplotlib
TED_Talks_Segmentation_and_Topics_Extraction.ipynb
: This is the Jupyter notebook where the TED Talks are segmented and topic modeling is performed.
- Clone this repository.
- Install the dependencies.
- Open the
TED_Talks_Segmentation_and_Topics_Extraction.ipynb
notebook in Jupyter. - Run all cells in the notebook.
The project follows these main steps:
- Preprocessing: The TED Talks are tokenized, filtered to remove non-alphabetic tokens, and stemmed.
- Clustering: The preprocessed TED Talks are segmented into clusters.
- Topic Modeling: Topic modeling is performed on each cluster using the LDA (Latent Dirichlet Allocation) model.
- Visualization: The topics identified by the LDA model are visualized using PyLDAvis.
The results of the topic modeling are displayed as an interactive visualization in the Jupyter notebook. Each bubble on the plot represents a topic, and the size of the bubble indicates the prevalence of that topic in the TED Talks.
Future work could involve refining the preprocessing steps, experimenting with different clustering algorithms, or trying different parameters for the LDA model.