Ldaxar / DL2020

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deep Learning assignment

Deep learning assignment using text data.
Kaggle source: Kaggle source: https://www.kaggle.com/kazanova/sentiment140

Structure

  • data: this folder stores all the external data used by the notebooks
  • models: contains the latest versions of our trained neural network models
  • pre_processing.ipynb: Contains the code for pre-processing the raw twitter data
  • BaseModels.ipynb: Contains the code for creating the baseline models
  • NN.ipynb: Contains the code for creating, training and testing Models

Usage

Before running any of the code, download the data files through the following links

  1. Vectorized tweets
  2. Processed tweets but not vectorized
  3. GloVe pre-trained word embeddings

Data set up

Before running experiments put base dataset with tweets into data/training.1600000.processed.noemoticon.csv. Once data is in place, run all steps in pre_processing notebook. Be careful with last two cells of the notebook. They are extremely taxing on memory, so it is advised to run only one of the two vectorization techniques.

BaseModels.ipynb

Before this can be run, "vec.csv" must exist in data folder. "vec.csv" can be downloaded from the link above, or generated from pre_processing.ipynb. Both Hashed and non-hashed data can be obtained by following the instruction in pre_processing.ipynb. Cells should be run sequentially, and will generate result graphs seen in fig.1 of report.

NN.ipynb

Before you can start creating, training or evaluating a model, run the first two cells in the notebook. This will run all the import and import the twitter data.

Creating and training a model

  • Run all function definitions in the data cells under part 1 and 2. (These cells contain function definitions for splitting the data and creating models)
  • Choose the dictionary and padding size and run the preprocessing functions. (These cells run the functions for splitting the data, we need the training data and targets)
  • Point 4 is divided into four parts that each contain 4 code cells. The four parts correspond to each model and can be run separately.
    1. creating the model
    2. training the model
    3. testing the model accuracy
    4. saving the model in the "models" folder

Testing a model

  • Run all function definitions in the data cells under part 1 and 2. (These cells contain function definitions for splitting the data and creating models)
  • Choose the dictionary and padding size and run the preprocessing functions. (These cells run the functions for splitting the data, we need the test data and targets)
  • load the models from the files in the "models" folder
  • run the get_report function to get a summary of the accuracy, precision, recall, f1-score and support of the model

About


Languages

Language:Jupyter Notebook 100.0%