This project focuses on the textual analysis of two parts of "The Hunger Games" trilogy, namely "The Hunger Games" and "Catching Fire". The project aims to compare these two books in terms of various linguistic and thematic aspects.
- The text of both "The Hunger Games" and "Catching Fire" was loaded to create a corpus for analysis.
- Dataset description: The dataset consists of the complete text of each book.
- Normalization: Text data underwent normalization to ensure consistency.
- Tokenization: Sentences were tokenized into individual words.
- Stopword Removal: Common stopwords were removed to focus on meaningful content.
- Stemming: Words were stemmed to reduce them to their root forms.
- Word Frequency Count: The occurrences of each word were counted in the corpus.
- Bar charts were created to visualize the frequency distribution of individual words in both books.
- Word clouds were generated to visually represent common words shared between the two books and unique words specific to each.
- A document-term matrix was created to represent the frequency of terms in each chapter.
- Distance Metrics: Various distance metrics were calculated for clustering and dendrogram creation.
- Dendrogram: A dendrogram was created to visually group chapters based on their textual similarities.
- Topic Modeling (LDA): Latent Dirichlet Allocation (LDA) was employed to identify and analyze themes in the books.
- Chapters were classified based on word occurrences using weighted binary, logarithmic, and weighted TF-IDF approaches.
- Confusion Matrix: A confusion matrix was generated to compare the performance of the three classifiers.
- Sentiment polarity indicators were created to measure the sentiment of each book.
- Bar Charts and Sentiment Analysis: Visualizations were created to analyze the sentiment and mood polarity in both parts of the trilogy.
Feel free to explore the code, adapt it to other books or datasets, and enhance the analysis as needed. Happy exploring!