wendyminai / APPROACHES-TO-MISSING-DATA-IN-TIME-SERIES-

I introduce the basic idea and implementation of 5 imputation approaches. In short, filling with a single value works well for a shorter period of missing values. MICE should be one of your first choices if the missing data is relatively long. It is explicitly designed for imputation tasks and can effectively learn data patterns.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

APPROACHES-TO-MISSING-DATA-IN-TIME-SERIES-

Missing data is a common problem in real-world datasets. If you’ve ever wondered how to handle missing values in time series data effectively, this post is for you! I will introduce five approaches for imputing missing values in time series data.

I’ll focus on univariate time series most time. We can impute each time series individually for handling missing values in multivariate time series. The last approach will consider the interaction of multiple time series and impute multivariate time series well.

Example dataset

I’ll use a time series from the public world energy consumption dataset. Figure 1 shows the aggregated daily energy consumption. Then I randomly pick four periods (two with ten days, two with 30 days) and remove the data. We will try to impute those four periods. Screen Shot 2023-01-11 at 9 41 29 AM

1. Imputation with a single value

This is the most straightforward approach. We impute a single value for all missing values. The typical choices are the last available observation value, the mean or median of the observation window, etc. The figure below shows this approach. Screen Shot 2023-01-11 at 9 43 28 AM As you can see, this approach is easy to implement and understand. The disadvantage is that it won’t learn any patterns from the available data, and filled data won’t have any variance.

2. Imputation by interpolation

Interpolation is a concept in the statistics field. The idea is that if we can fit a function based on known data points, we can construct new data points in the function curve. In terms of implementation, Pandas has the function interpolate() for Series and DataFrame objects. The figure below demonstrates the results of linear interpolation. Screen Shot 2023-01-11 at 9 46 28 AM

3. Imputation by Gaussian processes

This approach is like a general version of interpolation. The idea is that many potential functions generate the existing data points, some are possible, and some are unlikely. This approach will try to learn the distribution of those functions. The figure below gives one example of imputation by Gaussian Process. It will estimate the mean and variance of each missing observation. Therefore, we can have confidence intervals for imputed values. As you can see, the longer the missing period, the wider the confidence interval. Screen Shot 2023-01-11 at 9 48 13 AM

4. Imputation by Forecasting

This approach involves forecasting the values for the missing period as if they were future values. Time series forecasting is well-studied, and various models can be applied for this purpose. I used the Theta model and got the results shown in the figure below.Screen Shot 2023-01-11 at 9 52 12 AM

5. Imputation by Multivariate Imputation by Chained Equations (MICE)

This approach uses a specialized imputation algorithm that yields the best results for the example dataset.

The MICE method involves iteratively imputing the missing values in a dataset using a series of regression models, with the imputed values being used to update the estimates of the regression parameters.

I use the scikit-learn’s IterativeImputer to implement this approach (note: this estimator is still experimental as of 2023 Jan). The original MICE approach will return multiple imputations. However, the implementation of scikit-learn will only return a single imputation.

I still obtain multiple imputations by a trick: first, I segment the raw signal into rolling windows and add month info to the windows. Then MICE will impute the missing values for each window. Since one data point could appear in multiple windows, we will have numerous imputations for the same data points.

Screen Shot 2023-01-11 at 9 54 06 AM

The figure below shows the imputation results Screen Shot 2023-01-11 at 9 54 24 AM

CONCLUSION

Compared to the other approaches above, MICE gives the best results. Sections 1,2, and 4 have smaller RMSE, and most of the filled values are within the 99% confidence interval. Section 3 has obvious outliers (around 2007–02–21 to 2007–03–04) not following the existing patterns.

MICE not only provides confident intervals, but it also works for multivariate time series. Because of the way of regressions, it will learn interactions of multiple time series and use learned patterns to fill the missing values. I may write a detailed article on MICE for time series in the future.

About

I introduce the basic idea and implementation of 5 imputation approaches. In short, filling with a single value works well for a shorter period of missing values. MICE should be one of your first choices if the missing data is relatively long. It is explicitly designed for imputation tasks and can effectively learn data patterns.


Languages

Language:Jupyter Notebook 100.0%