crgl / starbucks

Take-home project for Starbucks Data Science position application (from Udacity)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Starbucks Data

Regressions on Binary Variables

Here is a fairly common problem: You have available to you some intervention (say, an email or a coupon), with some small fixed cost per person which will (ostensibly) increase the probability of a customer making a one-time purchase. There is strong evidence of the efficacy of the intervention (significantly more people make the purchase in the group that receives the intervention), but without any targeting you're making a net loss. This works out pretty neatly when it comes to financial interventions because everything shakes out to dollars, but targeting an intervention to a subgroup where the positives outweigh the negatives is pretty general. On the face of it, this is a binary classification problem. You want to classify customers into two groups: those who should receive the intervention and those who should not. However, let's look a little more at how you might make that decision.

In this case study, we are given a variety of features, a binary variable representing whether or not a promotion was introduced to the customer, and another binary variable representing whether or not that customer made a purchase. Because we don't have control over how the data was collected, First we have to look at the data we have available (since it was abstracted away to numbered features), so some combination of the describe method and all the marginal histograms is helpful. It looks like V2 is normally distributed (so it can be scaled and centered), V3 is uniformly distributed but in a close enough range that I'm not immediately worried about scaling, and V4 through V7 ought to be categorical. I'm not as clear on V1 since it (uniquely) starts at 0, and if it's some sort of count it would be inappropriate to one-hot encode it. Luckily that's something that can be tested in a grid search later on. Unfortunately I'm (still) not able to speculate on what these features are meant to represent, but we can move on to testing for randomization in the group selection.

This is in general difficult, but if nothing particularly devious has been set up I'm willing to just look at all the pairwise relationships and confirm they're the same. Correlation is low between all the variables in both the promotion and no promotion groups, but one number isn't really going to cut it. There are few enough features that it's feasible to look at 2D histograms for every pair for the total population as well as the promotion and no promotion groups. No differences are apparent, so I'm ready to move on and start predicting.

For a naive model, I want to select a couple of features and just filter on them to see whether it's possible to get positive net incremental revenue and to have something to compare a more advanced model to. As an easy way to pick the features, I train two linear regression models and compare the coefficients for the different features. Selecting positively on one feature and negatively on another cuts the targeted population by 2/3 and brings net incremental revenue into the black.

Logistic regression has some real benefits, when it comes to talking about probability. For one, it produces values that are valid probabilities distributed between 0 and 1. For another, people are often not pleased with you when you try to use OLS for data even though you know the residuals can't be normally distributed. However, there's something to be said for moving ahead with linear regression We want an estimate of the percentage chance of purchase, which is equivalent to the fraction of 1s, which is equivalent to the mean of the categorical variable represented as 1s and 0s. In this case, we're actually interested in the difference in these values between two groups. Minimizing the mean squared error provides an estimate of this mean, just as minimizing log loss in logistic regression provides an estimate of the probability. Each relies on a different model of the data. I initally used linear regression more on a lark than anything, but it performs well in cross validation and ultimately on testing data. It also performs roughly comparably to selecting on predict_proba from logistic regression, so it seems like both of them are behaving similarly.

Now, some people will point out that OLS is obviously wrong, in that the underlying model can't be true. These accusations, if true, are concerning! After all, if, say, income has a linear effect on the probability of making a one-time purchase, then our model may predict that some millionaire or billionaire might make two one-time purchases! To that I can only say: ask me about my second home. Seriously, I modeled rate of soccer goals once as a constant plus elapsed time multiplied by a constant factor. That's simplistic and clearly wrong: there is a trivial maximum rate of scoring goals in reality while the model treats that rate as unbounded. Similarly, if you're talking about the speed of a baseball you can usually get away with ignoring relativity (and even the rotation of the Earth).

More succinctly, all models are wrong. I particularly like OLS for this because whatever the individual predictions, the difference comes out comfortably on the order of probabilities. In addition, I'm just cutting things off at a threshold! I chose my threshold based on the interpretation as a probability, but it's a flat area of the curve. I could choose it instead based on properties of the training data, and then I'm just building a small decision tree on a linear combination of features. I could do LinearSVC with a soft margin in 2D using the two models, but if a relatively simple model works well, I'm all for it. A threshold based on vaguely sound theory on the one-dimensional output of a linear model is just right.

In the brief time that this is up while the notebook is being cleaned, I hope no one is too bothered by this hot take on OLS. I don't really have a dog in that fight, yet, I've just been surprised and pleased by how well it performed here.

About

Take-home project for Starbucks Data Science position application (from Udacity)


Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%