unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.

Home Page:https://unit8co.github.io/darts/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Static Covariates In Darts Global Model

ETTAN93 opened this issue · comments

How do you define static covariates appropriately in Darts? I was referring to the example notebook in Darts about static covariates but am unable to make it work in the context of a historical forecast.

I am trying to do a historical forecast for a LGBM global model. I constructed my full and train series and added static covariates to them:

for park_name, park_data_df in park_data_dict.items():
    target_series = TimeSeries.from_dataframe(park_data_df[[config.data.target_col]])[start_date:end_date]
    future_cov_series = TimeSeries.from_dataframe(park_data_df[config.data.future_cov])[start_date:end_date]
    past_cov_series = TimeSeries.from_dataframe(park_data_df[config.data.past_cov])[start_date:end_date]
  
    target_series_train = target_series[start_date:split_date]
    future_cov_train = future_cov_series[start_date:split_date]
    past_cov_train = past_cov_series[start_date:split_date]
  
    target_series_st = target_series.with_static_covariates(pd.DataFrame(data = {"park_name": [park_name]}))                                                          
    target_series_train_st = target_series_st[start_date:split_date]

    target_all_parks.append(target_series)
    future_cov_all_parks.append(future_cov_series)
    past_cov_all_parks.append(past_cov_series)
    target_train_all_parks.append(target_series_train)
    future_cov_train_all_parks.append(future_cov_train)
    past_cov_train_all_parks.append(past_cov_train)
                                                            
    target_all_parks_st.append(target_series_st)
    target_train_all_parks_st.append(target_series_train_st)  

Then I transformed the static covariate series:

from darts.dataprocessing.transformers import StaticCovariatesTransformer
scaler = StaticCovariatesTransformer()
target_train_all_parks_st_transformed = scaler.fit_transform(target_train_all_parks_st)
target_all_parks_st_transformed = scaler.transform(target_all_parks_st)

Then I passed it to my model:

model_estimator.fit(
    series = target_train_all_parks_st_transformed,
    past_covariates= past_cov_train_all_parks,
    future_covariates= future_cov_train_all_parks
)

hf_results = model_estimator.historical_forecasts(
      series=target_all_parks_st_transformed, 
      past_covariates= past_cov_all_parks,
      future_covariates= future_cov_all_parks
       ....
    ) 

However, I am getting the error:
image

Hi @ETTAN93, from the first glance I don't see why it shouldn't work.

Could you add a minimal reproducible example with some toy data?

Maybe just create 2 series and add the names of the parks as static covariates.

Also the static covariates transformer uses Ordinal encoding for strings, which might not be what you want for the parks. Maybe consider using a OneHotEncoder (sklearn) as transformer_cat instead (if the number of parks is not too large).

Hi @dennisbader, I created a toy sample set here and replicated the rest of the steps with the same error. I added the OneHotEncoder as you suggested although I'm not sure if the number of parks go up to 50 if one-hot encoding will cause problems.

import pandas as pd 
import numpy as np
from darts import TimeSeries
from dateutil.relativedelta import relativedelta
from darts.dataprocessing.transformers import StaticCovariatesTransformer
from darts.models import LightGBMModel
from sklearn.preprocessing import OneHotEncoder

park_list = ['a', 'b']
df_dict = {}

start_date = pd.Timestamp("2019-09-01 00:00:00", tz="utc").tz_convert(None)
split_date = pd.Timestamp("2020-01-31 23:59:00", tz="utc").tz_convert(None)  
end_date =  pd.Timestamp("2020-02-28 23:59:00", tz="utc").tz_convert(None) 

date_range = pd.date_range(start=start_date, end=end_date, freq='15T') 
random_floats = np.random.uniform(low=1.0, high=100.0, size=(len(date_range), 3))

# Create a DataFrame using the random floats
df_a = pd.DataFrame(random_floats, index=date_range, columns=['target', 'past_cov', 'future_cov'])
df_b = pd.DataFrame(random_floats, index=date_range, columns=['target', 'past_cov', 'future_cov'])
df_dict['a'] = df_a
df_dict['b'] = df_b

target_all = []
future_cov_all = []
past_cov_all = []
target_train_all = []
future_cov_train_all = []
past_cov_train_all = []
target_all_st = []
target_train_all_st = []

for name, df in df_dict.items():
    target_series = TimeSeries.from_dataframe(df[['target']])[start_date:end_date]
    future_cov_series = TimeSeries.from_dataframe(df[['future_cov']])[start_date:end_date]
    past_cov_series = TimeSeries.from_dataframe(df[['past_cov']])[start_date:end_date]
    
    target_series_train = target_series[start_date:split_date]
    future_cov_train = future_cov_series[start_date:split_date]
    past_cov_train = past_cov_series[start_date:split_date]
         
    target_series_st = target_series.with_static_covariates(pd.DataFrame(data = {"park_name": [name]}))
    target_series_train_st = target_series_st[start_date:split_date]                                        

    test_set_start_date = split_date + relativedelta(minutes = 1)
    print(f"Train Set Start: {target_series_train.time_index[0]}, End: {target_series_train.time_index[-1]}")
    print(f"Test Set Start: {test_set_start_date}, End: {end_date}")
    
    target_all.append(target_series)
    future_cov_all.append(future_cov_series)
    past_cov_all.append(past_cov_series)
    
    target_train_all.append(target_series_train)
    future_cov_train_all.append(future_cov_train)
    past_cov_train_all.append(past_cov_train)
    
    target_all_st.append(target_series_st)
    target_train_all_st.append(target_series_train_st)  
    
scaler = StaticCovariatesTransformer(OneHotEncoder)
target_train_all_st_transformed = scaler.fit_transform(target_train_all_st)
target_all_st_transformed = scaler.transform(target_all_st)

forecast_horizon = 3
target_lags = None
past_cov_lags = list(range(-8, 0))
future_cov_lags = list(range(-22, 21))

lgbm_model =  LightGBMModel(
    lags=target_lags,
    lags_past_covariates=past_cov_lags, 
    lags_future_covariates=future_cov_lags, 
    output_chunk_length=forecast_horizon,
    n_jobs=-1,
    random_state=42,
    multi_models=True,
)

lgbm_model.fit(
    series = target_train_all_st_transformed,
    past_covariates= past_cov_train_all,
    future_covariates= future_cov_train_all
)

hf_results = lgbm_model.historical_forecasts(
    series=target_all_st_transformed, 
    past_covariates=past_cov_all,
    future_covariates=future_cov_all,
    start=test_set_start_date, 
    retrain=False,
    forecast_horizon=3,
    stride=1,
    train_length = None,
    verbose=False,
    last_points_only=False,
) 

Error message I get:
image

If I pass target_train_all and target_all to fit and historical_forecast instead then it works fine. Only has a problem with a static covariate added.

Hi @ETTAN93, thanks for adding the example and reporting the issue. It is indeed a bug. It happens when using a regression model with lags=None but some static covariates. In historical_forecasts, the target series were not passed to the tabularization, and hence there was a mismatch between static covariates from the training and prediction sets.

#2426 will fix this.

@dennisbader got it thanks. If I am training a global LGBM model with the goal of being able to use it to predict new parks that does not have sufficient data for training, how would it work if a static covariate is passed when training the initial model? What would you need to pass as a 'static covariate' to the model to predict a new park?

@dennisbader I saw the fix is now merged to main. will it be released in a minor version anytime soon?

Also, regarding the previous question, if I train a global model including static covariates, how would that work when predicting a new park directly without training on that new park (assuming insufficient data)?

@ETTAN93, with categorical data, you should have all the possible values in your training set already. So if you use OneHotEncoding(), it will not work with a new park ID. It you use an OrdinalEncoding(), and use lightgbm's built-in categorical support (with categorical_static_covariates at model creation), it will also not work.

So something that you could do if you have new parks for prediction:

  • instead of using a unique park ID per park, define groups of similar parks and assign the same park ID to all parks within a group
  • for the new park, find the group that the park belongs to and then assign it the same ID

And about the release. We've just released version 0.30.0. So, if there are no other immediate issues that we have to fix, we'll probably not release within the next 2-3 weeks.

You could install directly from master with pip install "darts @ git+https://github.com/unit8co/darts.git@master".