An automated pipeline that can load a time series-esque database and perform data-drift analysis, visualize EDA/ETL insights, perform time-series feature engineering, train an SKLearn machine learning model, and forecast data.
Time series data is a collection of observations obtained through repeated measurements over time. It generally looks like this:
Time Value Region_Demarcation Channel_Type
0 2020-01 112 North Agent
1 2020-02 118 South International
2 2020-03 132 South Retail
3 2020-04 129 West Retail
4 2020-05 121 North HQ
.. ... ...
139 2022-08 606 East Military
140 2022-09 508 South Agent
141 2022-10 461 North Retail
142 2022-11 390 East Local
143 2022-12 ' ' North NaN
This pipeline forecasts data and is divided into six steps that are all automated.
First we take the data and clean it up to only take values of interest. Additionally we will also index it by datetime. For our example, the data will now look like:
Value
Datetime Index
2020-01 112
2020-02 118
2020-03 132
2020-04 129
2020-05 121
... ...
2022-08 606
2022-09 508
2022-10 461
2022-11 390
2̶0̶2̶2̶-̶1̶2̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶N̶a̶N ... note that NaN removal depends on your end goal + packages you may use
Now that we have the ETL process completed and streamlined the procees to convert the time series data into a DateTime indexed VoI (value of interest) series, we will now perform Exploratory Data Analysis. For this, the automated pipeline uses two tools. DTale, and Pandas-Profiling. These tools have graphical interfaces that makes data visualization easy and intuitive. Using these two tools, you can automatically create reports that have correlations (if multi-columnar data), distributions, interactions, missing values report, across all the different VoIs within a dataset. DTale helps expedite the manual coding for each visualization to make the proces about 10-20x faster from the 25-30 minutes it typically takes to code each visual. These tools are state-of-the-art free alternatives to expensive tools such as Tableau. All of this is automated in the D.A.P.S. pipeline as shown below:
dtale.mp4
When we perform analysis on data, it is important that we understand how stable our data is. If we don't account for shifts over time, our forecasting models will not be able to capture variance from new potential market factors, and will eventually degrade in performance. The traditional way of checking for data drfit is tedious and time consuming. So we make the process more efficient and faster using Popmon, a package that generates interactable report analyzing shifts in data over time. The automatic creation of plots within the D.A.P.S. pipeline significantly reduces the amount of time needed to manually generate these plots by over 90% from the 30 minutes it typically takes. Population monitoring to ensure stability is automated in the D.A.P.S. piepline as shown below:
popmon.mp4
With stable data and the EDA completed, we next set our eyes on automating feature engineering from a time series. Basic features such as the mean or the median are normally derived through programming, which is great when looking at the entirety of a data set. However, when looking at a time series and forecasting, we care about features over multiple windows of time, which makes recalculating something as simple as the mean over and over again inefficient. To solve this issue and automate this entirely, we use TSFresh. With TSFresh, the feature enginerring process can be done instantaneously, making the process about 9 times faster from the 45 minutes it took me to extract similar features manually.
Now that we have performed feature engineering, we can leverage the features and train models for predictive analysis. We can use an automatic process that integrates the features calculated in the pipeline and pass it off to an automated function. This function splits the data into test/train. This means we no longer have to manually upload or download data. :
Lastly we use the trained machine learning model for predictive analysis. Since D.A.P.S. is open source and free (and always will be), it uses the SKLearn modelling feature to predict data based on previous data. A rough high level overview of Training and Forecasting is as follows:
Train Data: Test Data: End Result
Value Value Value
Datetime Index Datetime Index Datetime Index
2020-01 112 | 2023-01 ' ' | 2020-01 112
2020-02 118 | | 2020-02 118
2020-03 132 | | 2020-03 132
2020-04 129 | | 2020-04 129
2020-05 121 | | 2020-05 121
... ... | | -------> ...
2022-08 606 | | 2022-08 606
2022-09 508 | | 2022-09 508
2022-10 461 | | 2022-10 461
2022-11 390 | | 2022-11 390
2023-01 xxx (Predicted Value)
The pipeline also makes plots for each data point predicted, plots them in a different color to assert distinction of past values versus future values, and outputs it as a gif as well: