In this lesson, we'll reiterate what you learned previously, and talk about integrated models (hence ARIMA, which means integrated ARMA), and extend to models that can cope with seasonality of time series.
In this lab you will:
- Preprocess a dataset to meet ARIMA based forecasting assumptions
- Identify best model parameters using grid search for p, d, q and seasonal p, d, q parameters
- Describe the components of an ARIMA model
- Create visualizations of future values as well as confidence intervals for the future predictions
- Evaluate an ARIMA model with validation testing
- Explain how validation testing is different with time series data than normal data
Time series provide the opportunity to predict/forecast future values based on previous values. Such analyses can be used to forecast trends in economics, weather, and capacity planning etc. The specific properties of time series data mean that specialized statistical methods are usually required.
So far, we have seen different techniques to make time series stationary, as well as white noise, moving average, AR, MA and ARMA models. Now recall that your data needs to be detrended (or made stationary) before you can go along and use ARMA models. This is because it is easier to add trends and seasonality back in after you modeled your data. Now there are several issues with ARMA:
- ARMA models assume that the detrending already happened
- ARMA neglects that seasonality can happen
Let's summarize what we can observe when having time series in three situations:
-
A strictly stationary series with no dependence among the values. This is the easy case wherein we can model the residuals as white noise. But this is very rare.
-
A non-stationary series with significant dependence among values, but no seasonality. In this case we can use ARMA models after we have detrended, or we can use an integrated ARMA model that detrends for us.
-
A non-stationary series with significant dependence among values, and seasonality. In this case we can use a seasonal arima or SARIMA model.
In this tutorial, we aim to produce reliable forecasts of a given time series by applying one of the most commonly used method for time series forecasting: ARIMA. After that we'll talk about seasonality and how to cope with it.
One of the methods available in Python to model and predict future points of a time series is known as SARIMAX, which stands for Seasonal AutoRegressive Integrated Moving Averages with eXogenous regressors. Here, we will primarily focus on the ARIMA component, which is used to fit time series data to better understand and forecast future points in the time series.
For this lab you will use the dataset that we have seen before - "Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory, Hawaii, U.S.A.," which collected CO2 samples from March 1958 to December 2001. Let's bring in this data and plot as demonstrated earlier. You will need to perform following tasks:
- Import necessary libraries
- Import the CO2 dataset from
statsmodels
- Ensure that the type of the column representing the dates is correct and set it as the index of the DataFrame
- Resample the data as monthly groups and take monthly average
- Fill in the missing values using the
.fillna()
and.bfill()
methods - Inspect the first few rows of the data
- Plot the time series of data
# Import necessary libraries
import warnings
warnings.filterwarnings('ignore')
import itertools
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
plt.style.use('ggplot')
# Load the dataset
dataset = None
# Convert into DataFrame
df = None
# Update to datetime type
# Set as index
# The 'MS' string groups the data in buckets by start of the month
# The term bfill means that we use the value before filling in missing values
# Plot the time series
As noted earlier, the time series has spikes reflecting an obvious seasonality pattern, as well as an overall increasing trend.
One of the most common methods used in time series forecasting is known as the ARIMA model, which stands for AutoregRessive Integrated Moving Average. ARIMA is a model that can be fitted to time series data in order to better understand or predict future points in the series.
Let's have a quick introduction to ARIMA. The ARIMA forecasting for a stationary time series is nothing but a linear (like a linear regression) equation. The predictors depend on the parameters (p,d,q) of the ARIMA model:
p
is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to rain tomorrow if it has been raining for past 3 days. AR terms are just lags of dependent variable. For instance if p is 5, the predictors for x(t) will be x(t-1)….x(t-5).
d
is the Integrated component of an ARIMA model. This value is concerned with the amount of differencing as it identifies the number of lag values to subtract from the current observation. Intuitively, this would be similar to stating that it is likely to rain tomorrow if the difference in amount of rain in the last n days is small.
q
is the moving average part of the model which is used to set the error of the model as a linear combination of the error values observed at previous time points in the past. MA terms form lagged forecast errors in prediction equation. For instance if q is 5, the predictors for x(t) will be e(t-1)….e(t-5) where e(i)
is the difference between the moving average at ith instant and actual value.
These three distinct integer values, (p, d, q), are used to parametrize ARIMA models. Because of that, ARIMA models are denoted with the notation ARIMA(p, d, q)
. Together these three parameters account for seasonality, trend, and noise in datasets:
(p, d, q)
are the non-seasonal parameters described above.(P, D, Q)
follow the same definition but are applied to the seasonal component of the time series.- The term
s
is the periodicity of the time series (4 for quarterly periods, 12 for yearly periods, etc.).
A detailed article on these parameters is available here.
The seasonal ARIMA method can appear daunting because of the multiple tuning parameters involved. We will now describe how to automate the process of identifying the optimal set of parameters for the seasonal ARIMA time series model.
The first step towards fitting an ARIMA model is to find the values of ARIMA(p,d,q)(P,D,Q)s
that produce the desired output. Selection of these parameters requires domain expertise and time. We shall first generate small ranges of these parameters and use a "grid search" to iteratively explore different combinations of parameters. For each combination of parameters, we fit a new seasonal ARIMA model with the SARIMAX()
function from the statsmodels
library and assess its overall quality.
SARIMAX
detailed documentation can be viewed here
Let's begin by generating example combination of parameters that we wish to use:
- Define p, q, and d parameters to take any value from 0/1 using
range()
function. (Note: We can try larger values which can make our model computationally expensive to run, you can try this as an additional experiment) - Generate combinations for
(p,d,q)
usingitertools.product
- Similarly, generate seasonal combinations as
(p,d,q)s
. Use s = 12 (constant) - Print some example combinations for seasonal ARIMA
# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = None
# Generate all different combinations of p, q and q triplets
pdq = None
# Generate all different combinations of seasonal p, q and q triplets (use 12 for frequency)
pdqs = None
For evaluating the model, we shall use the AIC (Akaike Information Criterion) value, which is provided by ARIMA models fitted using statsmodels
library. The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.
A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value. To achieve this, perform following tasks:
- Initialize an empty list to store results
- Iterate through all the paramaters in
pdq
with parameters in seasonalpdq
(nested loop ) to create a grid - Run
SARIMAX
fromstatsmodels
for each iteration. Details can be found here. Setenforce_stationarity
andenforce_invertibility
to False - Get the results in each iteration with
model.fit()
and store the AIC values - Find the lowest AIC and select parameters for further analysis
NOTE:
- Integrate exception handling with
continue
- An overview of Akaike Information Criterion can be viewed here
# Run a grid with pdq and seasonal pdq parameters calculated above and get the best AIC value
# Find the parameters with minimal AIC value
The output of our code suggests that ARIMA(1, 1, 1)x(1, 1, 1, 12)
yields the lowest AIC value of 277.78
. We should therefore consider this to be optimal option out of all the models we have considered.
Using grid search, we have identified the set of parameters that produces the best fitting model to our time series data. We can proceed to analyze this particular model in more depth.
We'll start by plugging the optimal parameter values into a new SARIMAX model.
# Plug the optimal parameter values into a new SARIMAX model
# Fit the model and print results
The model returns a lot of information, but we'll focus only on the table of coefficients. The coef
column above shows the importance of each feature and how each one impacts the time series patterns. The
For our time-series, we see that each weight has a p-value lower or close to 0.05, so it is reasonable to retain all of them in our model.
Next, we shall run model diagnostics to ensure that none of the assumptions made by the model have been violated.
Call the .plot_diagnostics()
method on ARIMA output below:
# Call plot_diagnostics() on the results calculated above
The purpose here is to ensure that residuals remain uncorrelated, normally distributed having zero mean. In the absence of these assumptions, we can not move forward and need further tweaking of the model.
Let's check for these assumptions from diagnostics plots.
-
In the top right plot, we see that the red KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.
-
The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.
-
The residuals over time (top left plot) don't display any obvious seasonality and appear to be white noise. This is confirmed by the autocorrelation (i.e. correlogram) plot on the bottom right, which shows that the time series residuals have low correlation with lagged versions of itself.
These observations lead us to conclude that our model has no correlations and provides a satisfactory fit to help forecast future values.
In order to validate the model, we start by comparing predicted values to real values of the time series, which will help us understand the accuracy of our forecasts.
The .get_prediction()
and .conf_int()
methods allow us to obtain the values and associated confidence intervals for forecasts of the time series.
In the cell below:
-
Get the predictions from 1st January 1998 till 2002 (end of time series)
-
Get the confidence intervals for all predictions
-
For
get_predictions()
, set thedynamic
parameter to False to ensure that we produce one-step ahead forecasts, meaning that forecasts at each point are generated using the full history up to that point
# Get predictions starting from 01-01-1998 and calculate confidence intervals
pred = None
pred_conf = None
We shall now plot the real and forecasted values of the CO2 time series to assess how well we did:
- Plot the observed values from the dataset, starting at 1990
- Use
.predicted_mean.plot()
method to plot predictions - Plot the confidence intervals overlapping the predicted values
# Plot real vs predicted values along with confidence interval
# Plot observed values
# Plot predicted values
# Plot the range for confidence intervals
# Set axes labels
The forecasts align with the true values as seen above, with overall increase trend. We shall also check for the accuracy of our forecasts using MSE (Mean Squared Error). This will provide us with the average error of our forecasts. For each predicted value, we compute its distance to the true value and square the result. The results need to be squared so that positive/negative differences do not cancel each other out when we compute the overall mean.
# Get the real and predicted values
CO2_forecasted = None
CO2_truth = None
# Compute the mean square error
mse = None
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
The MSE of our one-step ahead forecasts yields a value of 0.07, which is very low. An MSE this close to 0 indicates that the estimator is predicting observations of the parameter with perfect accuracy, which would be an ideal scenario but it is not typically possible.
We can achieve a deeper insight into model's predictive power using dynamic forecasts. In this case, we only use information from the time series up to a certain point, and after that, forecasts are generated using values from previous forecasted time points.
Repeat above calculation for predictions post 1998. Use Dynamic forecasting by setting dynamic
to True.
# Get dynamic predictions with confidence intervals as above
pred_dynamic = None
pred_dynamic_conf = None
Plotting the observed and forecasted values of the time series, we see that the overall forecasts are accurate even when using dynamic forecasts. All forecasted values (red line) match pretty closely to the ground truth (blue line), and are well within the confidence intervals of our forecast.
# Plot the dynamic forecast with confidence intervals as above
Once again, we quantify the predictive performance of our forecasts by computing the MSE.
# Extract the predicted and true values of our time series
CO2_forecasted = None
CO2_truth = None
# Compute the mean square error
mse = None
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
The predicted values obtained from the dynamic forecasts yield an MSE of 1.01. This is slightly higher than the one-step ahead, which is to be expected given that we are relying on less historical data from the time series.
Both the one-step ahead and dynamic forecasts confirm that this time series model is valid. However, much of the interest around time series forecasting is the ability to forecast future values way ahead in time.
We will now describe how to leverage our seasonal ARIMA time series model to forecast future values. The .get_forecast()
method of our time series output can compute forecasted values for a specified number of steps ahead.
# Get forecast 500 steps ahead in future
prediction = None
# Get confidence intervals of forecasts
pred_conf = None
We can use the output of this code to plot the time series and forecasts of its future values.
# Plot future predictions with confidence intervals
Both the forecasts and associated confidence interval that we have generated can now be used to further understand the time series and foresee what to expect. Our forecasts show that the time series is expected to continue increasing at a steady pace.
As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.
- Change the start date of your dynamic forecasts to see how this affects the overall quality of your forecasts.
- Try more combinations of parameters to see if you can improve the goodness-of-fit of your model.
- Select a different metric to select the best model. For example, we used the AIC measure to find the best model, but you could seek to optimize the out-of-sample mean square error instead.
In this lab, we described how to implement a seasonal ARIMA model in Python. We made extensive use of the pandas
and statsmodels
libraries and showed how to run model diagnostics, as well as how to produce forecasts of time series.