jasminsternkopf/nlp_with_disaster_tweets

classification latent-semantic-analysis natural-language-processing principal-component-analysis supervised-learning

This repository contains code comparing Latent Semantic Indexing (LSI) with a basic model without dimension reduction and two supervised learning methods based on LSI and Partial Least Squares: Semantic Indexing based on Partial Least Squares (SIPLS) and Local Semantic Indexing based on Partial Least Squares (LSIPLS). We compare those different dimension reduction techniques through a binary classification problem which is posed on the platform kaggle.com: "Real or Not? NLP with Disaster Tweets" (https://www.kaggle.com/c/nlp-getting-started/overview). Support Vector Machines are used as classifiers (except for SIPLS which has an own classification method).

To download the corresponding training and test set, it is neccessary to posess a kaggle account and to agree to the competition rules. Please do that and download the data into a subdirectory called "data". Some participants discovered that the ground truth of the test set was openly available. You can download it for example from people presenting their notebooks in which they used the original labels of the test set (https://www.kaggle.com/szelee/a-real-disaster-leaked-label). Please download this file into your data folder, too. It should be named submission.csv.

The output of "main.py" will be the test and training scores of the of the respective model depending on the dimension of the room in which the data is projected by LSI, SIPLS or LSIPLS, the hyperparameters which have been chosen by GridSearchCV for the corresponding model and a plot of those scores. Feel free to use this code and modify it in any way you need.

There are some constants defined which can be found in global_parameters.py. If you for example want to change the maximum dimension of the projection room for which the models should be computed, you can do so by changing MAX_DIM. Also the hyperparameters corresponding to SVC under which GridSearchCV tries to find the ones best suited for every model which uses SVC for classification can be found there. If you want to take a look at the scores and best parameters for the models projecting in rooms of dimensions 1 to 15, you can download the pickle files in which these information are saved from https://www.dropbox.com/s/yescvgmh9hzngcg/Score%20and%20parameter%20files%20for%20dimensions%201%2C...%2C15.rar?dl=0. Save them in your data folder and execute main.py.

About

Comparison of several dimension reduction methods aiming at the extraction of latent semantic information

classification latent-semantic-analysis natural-language-processing principal-component-analysis supervised-learning

Languages

Language:Python 100.0%