title | author | date | output | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Chapter 1 and 2: Introduction and Statistical Learning |
Andrew Andrade |
2/19/2017 |
|
Let's stat off with an analogy:
Probability is starting with an animal and figuring out what footprints it will make.
Statistics is seeing a footprint and guessing the animal.
Statistical learning is the study of seeing footprints, guessing animals, verifying the guesses and understanding why they made those types of footprints.
This guide will focus just on statistical learning: modeling real world phenomena both predicting "what" is going to happen based on past observations, and using inference to explain the "how" and the "why". Let's start with a toy example.
For a simple example, let's say we want to predict what mark someone would get on a test taken on a specific course. To do this we used what data was available to us: the number people's visits to the course website, how long they spend on the site and their mark on the test. How do we represent this problem?
The measured value we are trying to estimate in this case is the mark on the test (a continuous value from 0-100%). We are using 2 types of recorded measurements/inputs: a number of visits to the website (counting number) and the duration of time studying. (continuous value). Now we have 10 students in our class who are taking the test. There is also going to be some error in the prediction. For example, we won't be able to estimate their mark perfected by just measuring the logins. Maybe people who are logged onto the site might not have been studying the full time, for example watching TV in the background. If the measurements don't represent reality, there is a flaw in the measurement which leads to error. Who knows, maybe the system running the course website might have a bug and incorrectly counts logins or time spent browsing. Potentially more important, there are many other factors which would impact their mark which is not (and can not be) potentially measured (their skill level, how much sleep they got, how much they dislike their teacher etc.) The point is, there is going to be some error irrespective on what data you collect, how you collect it or what function you choose.
Now, let's formulate this information in math notation.
We want to predict Y using a function which will call f() and inputs X.
In this representation of the word, the thing we are trying to predict (Y) is a function (f()) of some measured variables (X) and some random error
Now we can think of X as the matrix of observations, with n as the number of observations, and p are the number of types of inputs/measurements (commonly known as feature variable or features). n = 10 in our example since we have 10 students, and p =2 since we have 2 predictors/inputs/feature (number of logins, and time spent studying).
For readers who are unfamiliar with matrices, it is useful to visualize X as a spreadsheet of numbers with n rows and p columns.
Y is the response can be thought as a vector (a matrix with 1 column) with a measured response for each measurement.
Statistical learning is focused on trying to estimate that function f() which maps the inputs to the output. We will soon learn how to do this but first: Why estimate f().
We want to estimate f() for two reasons: (1) Prediction or (2) Inference
Since we don't have information on the full population of students, we have to estimate what f() looks like based on our observations. Our goal is to estimate the mark that anyone would get on the test, given the information from only 10 students took the class. You can imagine, in the real world, we would want a larger sample size (a large number of n observations) and some more features/predictors (p) which better represent what their mark will be on the test.
Since the function we are estimating is not the actual full population of people who could take the class and do (we don't have data on future enrollment and performance), we use a ^ to indicate that it's a prediction or estimation of the population.
Generally, most people treat
"Goodness" and "better" are relative terms, so the first thing to do in prediction after determining what is being measured as the response, our inputs or feature variables (commonly known as features)
For numerical prediction (called regression), to test how well the model fits observations, we can take the difference between what we measured actually happened (Y) and the predicted response (
This isn't the best way to measure incorrectness. Here is an example, let's say we guessed that a student would get 100% on a test but they got 90%. We were wrong by 10% (+10% wrong according to the error formula). Similarly, we guessed a student would get 50% but they actually got 60% (-10% wrong according to the error formula). In this case we too were wrong by 10%, yet the average error would be 0% (
To fix this we want to get the absolute error (make the error positive whenever we are wrong). Another thing we would want to do is to penalize larger errors. For example, estimating a mark of 50% when the student gets a 60% (difference of 10%) is worse than estimating 55% (difference of 5%). Let's call the function that tells us how "bad" our model estimation is the expected value function (E()). A simple way to make the expected value of the error or difference in our prediction and actual measurement positive, and penalize incorrectness is to take the square of the error (also called quadratic loss).
Let's describe that in math speak. Now since we know the formula for
The accuracy of our estimation model
The reducible error is reduced by choosing an appropriate model
The focus of statistical learning is on techniques for estimating f with the aim of minimizing the reducible error. It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of our prediction for Y. This bound is almost always unknown in practice.
The second thing which we are interested in is understanding how the response we are measuring (Y) is affected by the values of our inputs/features (
We want to answer:
- Which predictors are associated with the response?
- What is the relationship between the response and each predictor?
- Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
For example in advertising:
- Which media contribute to sales?
- Which media generate the biggest boost in sales? or
- How much increase in sales is associated with a given increase in TV advertising?
Finally, some modeling could be conducted both for prediction and inference. For example, in a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, the size of houses, and so forth. In this case, one might be interested in how the individual input variables affect the prices that is, how much extra will a house be worth if it has a view of the river? This is an inference problem. Alternatively, one may simply be interested in predicting the value of a home given its characteristics: is this house under- or over-valued? This is a prediction problem.
The first step in estimating is setting up a set of training data. These observations are called the training data because we will use these observations to train, or teach, our method how to estimate f(). Our goal is to apply a statistical learning method to the training data
training data in order to estimate the unknown function f(). In other words, we want to find a function estimate
From Professor Pedro Domingos's A few useful things to know about machine learning:
This is because, no matter how much data we have, it is very unlikely that we will see those exact examples again in the future. The most common mistake among beginners is to test on the training data and have the illusion of success. If the chosen model is then tested on new data, it is often no better than random guessing. So, if you hire someone to estimate a model, be sure to keep some of the data to yourself and test the model they give you on it. Conversely, if you've been hired to build a model, set some of the data aside from the beginning, and only use it to test your chosen model at the very end, followed by learning your final model on the whole data.
We will learn the details of splitting training and test data, later, but it is very important that you never estimate models on the full dataset as this can lead to models which do not reflect reality. Since there will always be an error, it is good to keep in mind that there will be an error in both the training and testing of the models. The important thing is that error on the training set and the testing set be similar. It is better to have a model which is correct 75% on both the training set and testing set than a model which is 100% in the training set and 60% on the test set.
The general approach to model estimation is:
- Representation: the model must be represented in some formal language that the computer can handle.
- Evaluation: an evaluation function (also called objective function or scoring function) is needed to distinguish "good" models from bad ones
- Selection: Finally, we need a method to search among the models for the highest-scoring one based on the test data (held out during training).
Let's tackle representation first. Broadly speaking, most statistical learning methods for this task can be represented as either a parametric or non-parametric model.
Parametric methods involve a two-step model-based approach.
- Functional form: We make an assumption about the functional form, or shape, of f().
The simpliest form of a model is a linear (described extensively in chapter 3):
This model is defined by a number of parameters
- After a model has been selected, we need a procedure that uses the training data to fit or train the model. In the case of the linear model, we need to estimate the parameters
(
$\beta_0, \beta_1 + ... + \beta_p$ ) such that the linear model is approximately equal to our measure output ($Y approx \beta_0 + \beta_1 X_1 + ... + \beta_p X_p)$ .
The most common approach to fitting the linear model is referred to as (ordinary) least squares, which we discuss in Chapter 3. However, least squares is one of many possible ways way to fit the linear model.
For example, let's say we are statistical consultants working for a firm who is trying to improve sales by running an advertising campaign. In this case, we know historically how much money was spent on Radio and TV advertising, and what the sales were during those times. Now we want to estimate a model to determine the Sales based on Radio and TV advertising. A simple parametric model would fit a linear plane to the data. In 2 dimensions we would fit a line, and in 3 dimensions (sales, radio, and TV) we would fit a plane. This linear model would be in the form:
Fitting it to actual data from the Advertising
dataset we get the following:
We can think about fitting a piece of paper such that the paper is as close to as many points as possible. It's hard to visualize that it looks like in 4 or dimensions (as we add features), but planes work the same way.
Since we have assumed a linear relationship between the response and the two predictors, the entire fitting problem reduces to estimating
The final step in modeling is evaluating the fit, and selecting the best model. Looking at the figure, the linear model is not quite right: the true f() has some curvature that is not captured in the linear fit. However, the linear fit still appears to do a reasonable job of capturing the positive relationship between Radio and TV advertising. Let's move onto non-parametric models and see if we can get a better fit.