harshildarji / DataScienceLab

Data Science Lab - SS - 2019

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Science Lab - SS - 2019

Total alerts Language grade: Python

Dataset: https://archive.org/download/archiveteam-twitter-stream-2018-04/archiveteam-twitter-stream-2018-04.tar

  1. filter.py
    This script will go through all the JSON files in dataset folder, and will only store the tweet if it matches following criterias:
    - extended_tweet is NOT null
    - lang is en (English)
    - Tweet contains word(s) defined in keyWords list
    It will not store all the details of a particular tweets, but only the features we require for our purpose:
    - Twitter User Desciption
    - Tweet
    All this information will be stored in csv format (saved as all_data.csv).

  2. label.py
    Since we need to manually annotate all the selected tweets, this script will provide a simple command line interface to help with that.
    This will present the user with a tweet (from all_data.csv, line by line), user will input 1 or 0 where:
    - 1: Tweet is migration relevant
    - 0: Tweet is NOT migration relevant
    Once the user will hit enter, label will be stored in train_label.csv.

  3. annotation.ipynb
    This notebook trains and performs evaluation on the labelled data.
    Pipeline (for now):
    - Import data, and remove rows with null values in any columns
    - Balance the dataset using SMOTE
    - Prepare TF-IDF and Doc2Vec feature extraction techniques
    - Provide appropriate data and labels to both the techniques, train classifiers using retrieved feature vectors
    - Perform classification on a seperate validation set
    - Print and Plot results!

About

Data Science Lab - SS - 2019


Languages

Language:Jupyter Notebook 98.3%Language:Python 1.7%