Final Project

44599: Special Topics (Machine Learning)

For this project you will be working with a subset of a dataset generated by an online survey. This dataset will require you to use many of the tools and techniques that you've learned about through the course of the semester.

The data

This data was generated with a survey which is a recreation of an XKCD comic survey (the data from the original survey is believed to be lost). You may find it helpful to refer to the original survey during this project.

The features of this dataset (beyond timestamp) are the questions in the survey; an individual instance is one persons' response to the entire survey.

Choose your question, explore your data, and split your data

Included with this project are six csv files; these files correspond to a training set for a specific feature. You are to choose one of the following features to predict in your project:

Feature name/question	File Name
Do you have strong opinions about text editors?	`editors.csv`
Do you usually remember your dreams?	`dreams.csv`
Do you get colds often?	`colds.csv`
Do you know your Myers-Briggs type?	`type.csv`
Do you eat condiments directly out of the fridge as a snack?	`condiments.csv`
Do you spend a lot of time in the sun?	`sun.csv`

Your goal will be to train a model to predict the yes or no answer that would be provided in the survey based on the other features.

Create and modify your notebook to indicate which question you are answering (in a markdown cell at the top that should also contain your name), open the file that corresponds to your question, and explore the data. Look at the description of your data, value counts, etc. Additionally you must get a training and testing set; split your data using your prediction feature to stratify.

Create/modify your features

Much of the data in this survay would benefit from One Hot encoding (as it is categorical data). We covered in class how to do this in sklearn. Questions that use a radio button such as "What kind of cell phone do you have?" would benefit from this treatment.

Some questions have binary answers (yes/no) but are encoded as strings. You may find it helpful to go in and change those to be 0 and 1 in your dataframe.

Some of the questions are multi-answer questions (similar to multiple choice, but allowing check boxes instead of radio buttons). If we want to use One Hot encoding for this, we need to take a more manual approach.
The following code looks at the question "Which of these words do you know the meaning of?" and checks to see if the survey taker selected "Slickle" and creates a feature with that information.

slickle_in = []
for s in data['Which of these words do you know the meaning of?']:
        if isinstance(s, str):
            slickle_in.append('Slickle' in s)
        else:
            slickle_in.append(False)
data['slickle_meaning'] = slickle_in

We would then go through and do the same for all the other options in the question.

You may find it useful to create a function that takes a question and a term to determine whether or not the term appears in the answer. Consult the original survey (linked above) to see the possible answers for these kinds of questions.

You MUST create or modify at least one feature in the data set

Choose and train your model

After you have modified your data it is time to choose your algorithm and features for training. Your notebook must contain an explanation for why you chose your algorithm. Additionally, you should explain why you chose your features for training (you should have at least looked at your data). When you are happy with the performance of your model you can then evaluate its performance on your testing set.

You are not limited to models and algorithms we have discussed in class; feel free to explore what is available to you in sklearn (this is the only limitation; I don't want to be installing additional packages)

Explore!

Use the tools we've talked about in class! Tools such as Dimension Reduction and cross validation (using your training set) can be useful in selecting the important features to train your model with!

BONUS

Each of the provided data sets is a training set I have generated from the original data; I have retained a testing set from the original data for each of these features.
After this project is due I will evaluate each model that follows the required guidelines below on the test set. The person with the highest performing model for each question will receive 5 bonus points on this project. In the unlikely event of a tie:

If the individuals who have the highest performing models used the same models (defining similarity depends on the algorithm; avoid same algorithm and feature choice) no extra credit will be earned for that question
If the individuals who have the highest performing models used different models and obtain the same performance, all students will get the bonus points.

Bonus guidelines

In order for your model to be considered for extra credit you must either:

provide a function called prepare_df that takes a single data frame and modifies it with all necessary features for your trained model to predict the values
create a Pipeline that can be fed the pandas dataframe and ends with your learner.

Additionally, you must create an array of strings called prediction_features that indicate the features you are using to predict on

Due to the interesting one-hot encoding possibilities with the multi-answer questions in the survey, you may not be able to easily create a pipeline. Essentially I must be able to load my data frame, then either predict the values using your pipeline or:

# data frame is test_df
prepare_df(test_df)
# get the features you use; I'll grab them from your code
your_model.predict(test_df[prediction_features])

annie0sc / final-project-annie0sc