feature selection in time series model

Question

feature selection in time series model

kennis222 opened this issue 3 years ago · comments

Hello contributors,

When I used the time series model such as var model or sarimax model in multivariate time series , after fitting the model and printing the model.summary(), the information seemed that model automatically did the feature selection. However, when I read the source codes such as build_var, and init , I could not figure out what kinds of method that the package uses to select features. For example, there are 6 time series as inputs, but the model summary only displayed the information between one input and the dependent variable, in other words, the output.

Does the package implements any feature selection function when using time series model such as var or sarimax? If the answer is "Yes", where are the detailed parts? Thank you.

Home of AutoViz, AutoViML and featurewiz · Answer 1 · Wed Jan 12 2022 23:12:39 GMT+0800 (China Standard Time)

Hi @kennis222 👍
Could you please illustrate by means of a couple of screenshots as to what you are inputting and what you are getting back? That would be really helpful to understand what you mean here.
AutoViML

kennis · Answer 2 · Fri Jan 14 2022 13:02:38 GMT+0800 (China Standard Time)

For example, I used a dataset from AirQualityUCI, and I set target variable: CO(GT). After fitting the model, and I printed the model summary as you can see below the screenshots. The dependent variables only displayed the 'CO(GT)' and 'NO2(GT)'. It seemed that model automatically did the feature selection. I was not sure. I read the source codes such as build_var, and init , but I could not figure out what kinds of method that the package uses to select features.

Home of AutoViz, AutoViML and featurewiz · Answer 3 · Sat Jan 15 2022 09:38:00 GMT+0800 (China Standard Time)

Hi @kennis222 👍
You are correct - the VAR model does feature selection automatically - it is not something that is encoded. It is part of the VAR modeling process. I hope that this screenshot clarifies how it works.

You can see that it selects the best variable automatically here

I hope this answers your question. If so, please close the issue.
Thanks
AutoViML

kennis · Answer 4 · Mon Jan 17 2022 14:03:35 GMT+0800 (China Standard Time)

Hi contributors,
I have checked to use the statsmodel_varmax model directly, which is based on the source codes. However, the results displayed are not the same as the results from package.

Let me provide the codes I used for this example.
import statsmodels.api as sm
from auto_ts import auto_timeseries
import pandas as pd
import numpy as np

data = pd.read_excel("AirQualityUCI.xlsx")
data.drop(['Unnamed: 15','Unnamed: 16'],axis=1,inplace=True)
df = data.groupby(['Date']).mean().reset_index()
df['Date'] = df['Date'].astype('str')
length = int(len(df)*0.9)
train_data = df[:length]
test_data = df[length:]
print(train_data.shape)
print(test_data.shape)

Auto_TS

ts_column = 'Date'
target ='CO(GT)'
model = auto_timeseries(score_type='rmse',model_type=['VAR'],verbose=2)
model.fit( traindata=train_data, ts_column=ts_column, target=target, cv=3,sep = ',')
var_model = model.get_best_model()
var_model.summary()
#varmax
train_data_test = train_data.copy()
train_data_test.index= train_data_test['Date']
train_data_test.drop(['Date'],axis=1,inplace=True)
endog = train_data_test.loc[train_data_test.index, list(train_data_test.columns)]. #select all variables as endog
mod = sm.tsa.VARMAX(endog=endog,order=(1,1))
res = mod.fit(maxiter=1000, disp=False)
print(res.summary())

P.S. When you are developing the package, do you process something to control to figure out the optimal parameters or features?

kennis · Answer 5 · Mon Jan 17 2022 14:34:16 GMT+0800 (China Standard Time)

I think I have figured out the issue, no offense and I am just curious whether it is possible to be improved.
In build_var.py, under the find_best_parameters function
'''
for d_val in range(1, dmax):
# Takes the target column and one other endogenous column at a time
# and makes a prediction based on that. Then selects the best
# exogenous column at the end.
'''
"Takes the target column and one other endogenous column at a time" is the reason why the best variable selected for VAR: AH. However, the best VAR model may be included more than one other endogenous variable.

Home of AutoViz, AutoViML and featurewiz · Answer 6 · Tue Jan 18 2022 07:02:48 GMT+0800 (China Standard Time)

hi @kennis222 👍
You are correct that we are choosing maximum one variable. The reason is that VARMAX is very slow even for small datasets. If we were to try every possible combination of variables, we could be running for a very long time even for tiny datasets. Hence a choice was made it to limit it to one.
If you have a better way or suggestions to make it faster and better, let us know.
AutoViML

kennis · Answer 7 · Tue Jan 18 2022 09:45:09 GMT+0800 (China Standard Time)

Thank you.