AutoViML / Auto_TS

Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Created by Ram Seshadri. Collaborators welcome.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

feature selection in time series model

kennis222 opened this issue · comments

Hello contributors,

When I used the time series model such as var model or sarimax model in multivariate time series , after fitting the model and printing the model.summary(), the information seemed that model automatically did the feature selection. However, when I read the source codes such as build_var, and init , I could not figure out what kinds of method that the package uses to select features. For example, there are 6 time series as inputs, but the model summary only displayed the information between one input and the dependent variable, in other words, the output.

Does the package implements any feature selection function when using time series model such as var or sarimax? If the answer is "Yes", where are the detailed parts? Thank you.

Hi @kennis222 👍
Could you please illustrate by means of a couple of screenshots as to what you are inputting and what you are getting back? That would be really helpful to understand what you mean here.
AutoViML

For example, I used a dataset from AirQualityUCI, and I set target variable: CO(GT). After fitting the model, and I printed the model summary as you can see below the screenshots. The dependent variables only displayed the 'CO(GT)' and 'NO2(GT)'. It seemed that model automatically did the feature selection. I was not sure. I read the source codes such as build_var, and init , but I could not figure out what kinds of method that the package uses to select features.
Screen Shot 2022-01-14 at 3 57 29 PM
Screen Shot 2022-01-14 at 3 57 38 PM
Screen Shot 2022-01-14 at 3 58 14 PM

Hi @kennis222 👍
You are correct - the VAR model does feature selection automatically - it is not something that is encoded. It is part of the VAR modeling process. I hope that this screenshot clarifies how it works.
image

You can see that it selects the best variable automatically here
image

I hope this answers your question. If so, please close the issue.
Thanks
AutoViML

Hi contributors,
I have checked to use the statsmodel_varmax model directly, which is based on the source codes. However, the results displayed are not the same as the results from package.
Screen Shot 2022-01-17 at 1 53 34 PM

Let me provide the codes I used for this example.
import statsmodels.api as sm
from auto_ts import auto_timeseries
import pandas as pd
import numpy as np

data = pd.read_excel("AirQualityUCI.xlsx")
data.drop(['Unnamed: 15','Unnamed: 16'],axis=1,inplace=True)
df = data.groupby(['Date']).mean().reset_index()
df['Date'] = df['Date'].astype('str')
length = int(len(df)*0.9)
train_data = df[:length]
test_data = df[length:]
print(train_data.shape)
print(test_data.shape)

Auto_TS

ts_column = 'Date'
target ='CO(GT)'
model = auto_timeseries(score_type='rmse',model_type=['VAR'],verbose=2)
model.fit( traindata=train_data, ts_column=ts_column, target=target, cv=3,sep = ',')
var_model = model.get_best_model()
var_model.summary()
#varmax
train_data_test = train_data.copy()
train_data_test.index= train_data_test['Date']
train_data_test.drop(['Date'],axis=1,inplace=True)
endog = train_data_test.loc[train_data_test.index, list(train_data_test.columns)]. #select all variables as endog
mod = sm.tsa.VARMAX(endog=endog,order=(1,1))
res = mod.fit(maxiter=1000, disp=False)
print(res.summary())

P.S. When you are developing the package, do you process something to control to figure out the optimal parameters or features?

I think I have figured out the issue, no offense and I am just curious whether it is possible to be improved.
In build_var.py, under the find_best_parameters function
'''
for d_val in range(1, dmax):
# Takes the target column and one other endogenous column at a time
# and makes a prediction based on that. Then selects the best
# exogenous column at the end.
'''
"Takes the target column and one other endogenous column at a time" is the reason why the best variable selected for VAR: AH. However, the best VAR model may be included more than one other endogenous variable.

hi @kennis222 👍
You are correct that we are choosing maximum one variable. The reason is that VARMAX is very slow even for small datasets. If we were to try every possible combination of variables, we could be running for a very long time even for tiny datasets. Hence a choice was made it to limit it to one.
If you have a better way or suggestions to make it faster and better, let us know.
AutoViML

Thank you.