Final Project ML

Final project is a group effort (with Chris and Paul) for a Kaggle competition.

NOTE: this is a private competition.

1 Dependencies:

    install.packages("devtools")
    devtools::install_github("DataComputing/DataComputing")

2 COMPLETED April29: Project Overview

Note: this is a rough outline of our project and similar to report but not meant to be identical / follow exactly.

This project was completed and turned in on April 29, 2016.

2.1 Overview of Final Steps:

@Oliver - write up first complete draft of report
@Chris - check SVM on server
@Paul - check SVM on server
@Chris + @Paul - review Oliver's complete draft once ready - edit!

2.2 Detailed Goals Now:

* describe the use of randomForest + XGBoost for feature selection:

    *see the report-notes-overview.Rmd file for detailed explanation*

2.2.2 SVM: Parameter Tuning

* 5-Fold CV for both linear and Gaussian kernel SVM via:

    *Cost* (C) - from <code>10^-4, 10^-3, ..., 10^4</code>
    *Sigma* (only for Gaussian kernel method) - from <code>10^-3, ..., 10^3</code>

    Using *grid-search* methods through the *Caret* package in *R*, we found that the optimal parameters for linear and Gaussian kernel SVM models were <strong>0.1 for both the Cost and Sigma values</strong>. This conclusion was obtained through independently running both SVM methods as well as using a <code>0.5%, 1%, 5%, and 20%</code> split of the original labeled data for tuning.

2.2.3 XGBoost: Parameter Tuning

* 5-fold CV for all parameter tuning via randomization

2.2.4 ?: Parameter Tuning

* dfdf

3 Method Review and Critique

3.1 Valid Data Cleaning and Feature Extraction?

asdfdsf

3.2 Increased Tuning and Selection

We noticed several components of our models and extraction methods involved significant assumptions about data distribution and relationships. For example, our term frequency usage involved the assumption that a single word is a significant unit of sentiment measurement in tweets. This, in turn, ignores the possibility that character or word combinations may have equal or more predictive importance. It also ignores the possibility that characters and words are not significant measurements of sentiment - either due to a lack of pattern signal in the noise or a fundamental insignificance in tweets (as mentioned above).

4 Future Work Ideas:

4.1 Ensemble Approach

Although computationally expensive, an ensemble approach to predicting sentiment of tweets could prove extremely useful. Unfortunately, we ran out of time to attempt ana ensemble. However, we crafted several approaches to creating an ensemble to potentially increase predictive accuracy, AUC, and decrease the likelihood of overfitting. Here we describe two routes of next-step ensemble models:

4.1.1 SVM + Recursive Trees

This idea attempts to use SVM and recursive tree models in order to perform regression and classification in an extremely robust and non-linear fashion.

4.1.2 Deep XGBoost Model

This idea involves recursively running boosted and bagged tree models - via XGBoost - to generate probability features based on the original data (i.e. tweets). We can continue to generate second-order "artifical" feature sets to use as input for another layer of boosted and bagged trees (or forests).

4.1.3 Multi-model Combination

In order to account for the strengths of certain model components and behavior, while minimizing the negative effects of others, we can use a linear or quadratic combination method of predicted features and / or probabilities. For example, we can generate a set of XGBoost probabilities through 10 optimal parameter sets while simultaneously obtaining 10 SVM classification predictions. These 20 model results can be linearly or quadratically combined in order to produce a single, robust prediction. It should be noted that the combination process could use a number of approaches, similar to distance and clustering metrics. E.g. average probability (mean) per feature, median probability, majority-rule, etc.

Below, all old write-up and code, is kept as reference / record

3 To-Do and Predictor Ideas:

3.1 April 20 Agenda

Oliver: create script for testing feature importance and model validity / accuracy / stability (sample-r/validityTests.R)
Chris: put together standard bag-of-words model with 100k Tweets (maybe 80/20 or 60/40 train-test split)
Paul: create function(s) to add features we designed earlier to data (i.e. create predictors)

3.2 April 18

3.2.1 Predictor Ideas:

And more...

Linear combination of normalized positive and normalized negative username / hashtag occurrences
Create a few special columns for occurrence of most-negative or most-positive usernames / hashtags (see below)
[ ]

For example: top 3 usernames that appear with negative tweets may have thousands of occurrences and the 4th most popular negative username may be only 20 occurrences. We could create 3 special columns, one per top username of negative tweets, and each tweet has a 1 in that column if that username appears.

4 Organization:

4.1 README.md

This file. Deescription of this repository. Contains references used (links) and explainations.

4.2 description.pdf

Description from class about competition = logistics.

4.3 data (directory = folder)

Store data given from competition website in this directory (i.e. folder).

4.4 sample-r

Sample code / script files in R. May contain basic structure of executing / printing / formating / etc various machine learning and data analysis techniques. For example: the sample-cart.R file contains general code for basic decision tree (including printing).

This directory also contains the TrainTest.RData given from the competition website. This contains a compact dataset that is a subset of the MaskedData.npz file. This compact dataset contains only 50,000 observations for training. To use:

    load("TrainTest.RData")
    # X, y, & XTest

4.5 data-mod-log

This directory (i.e. folder) contains data files that we have modified. In addition, it contains a markdown file that has a written description of all of the modifications / changes we have made or algorithms we have attempted (with results / insight).

5 References:

Here are various references and tutorials for machine learning algorithms, Twitter data analysis in R, and various sentiment analysis in R (and some in Python).

5.1 Text Analysis Files:

5.2 Twitter R Tutorials:

5.3 TF-IDF Tutorials:

Term frequency - inverse document frequency. Frequencies of words are offset by their global frequencies. So it is finding the relative frequency of words.

Thru-Echoes / ml-final-proj