Nixtla / mlforecast

Scalable machine 🤖 learning for time series forecasting.

Home Page:https://nixtlaverse.nixtla.io/mlforecast

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Distributed] Adding external variable for distributed

iamyihwa opened this issue · comments

What happened + What you expected to happen

fcst = DistributedMLForecast()
fcst.predict(h = 12, X_df = X_sf_test).collect()
-> TypeError: predict() got an unexpected keyword argument 'X_df'
fcst.predict(h = 12, new_df = X_sf_test).collect()
-> PythonException: An exception was thrown from a UDF: 'ValueError: The following columns are missing: ['y']'. Full traceback below:

Neither new_df (https://github.com/Nixtla/mlforecast/blob/main/mlforecast/distributed/forecast.py) nor X_df (https://nixtla.github.io/mlforecast/docs/how-to-guides/exogenous_features.html ) seems to work for putting external variables for the distributed version of the forecasting.

Versions / Dependencies

0.10.0

Reproduction script

#from mlforecast.distributed.models.spark.lgb import SparkLGBMForecast
#models = [SparkLGBMForecast()]
from mlforecast.distributed import DistributedMLForecast
from xgboost.spark import SparkXGBRegressor
from synapse.ml.lightgbm import LightGBMRegressor
from mlforecast.distributed.models.spark.xgb import SparkXGBForecast
models = [ SparkXGBForecast() ] # SparkXGBRegressor()] # ,
fcst = DistributedMLForecast(
models,
freq='D',
lags=[1],
lag_transforms={
1: [expanding_mean]
},
date_features=['dayofweek'],
)
fcst.fit(
Y_sf_train,
static_features=['x', 'y'],
)

fcst.predict(h = 12, X_df = X_sf_test)

Issue Severity

None

Hey @iamyihwa, thanks for using mlforecast. The 0.10.0 version takes dynamic_dfs which is a list of pandas dataframes, can you try using that?

@jmoralez Thanks for your suggestion.

However still getting an error after using dynamic_dfs .
Also one question, which version does documentation correspond to???

2023-11-06 17:02:14,311 INFO XGBoost-PySpark: _fit Running xgboost-2.0.1 on 1 workers with
booster params: {'objective': 'reg:squarederror', 'device': 'cpu', 'nthread': 1}
train_call_kwargs_params: {'verbose_eval': True, 'num_boost_round': 100}
dmatrix_kwargs: {'nthread': 1, 'missing': nan}
2023-11-06 17:02:25,233 INFO XGBoost-PySpark: _fit Finished xgboost training!
TypeError: cannot pickle '_thread.RLock' object

Details of the error are :
/local_disk0/.ephemeral_nfs/envs/pythonEnv-14049207-8c62-4864-93a7-4bc1fb414ecc/lib/python3.8/site-packages/mlforecast/utils.py in inner(*args, **kwargs)
162 new_args.append(kwargs.pop(arg_names[i]))
163 new_args.append(kwargs.pop(old_name))
--> 164 return f(*new_args, **kwargs)
165
166 return inner

/local_disk0/.ephemeral_nfs/envs/pythonEnv-14049207-8c62-4864-93a7-4bc1fb414ecc/lib/python3.8/site-packages/mlforecast/distributed/forecast.py in predict(self, h, dynamic_dfs, before_predict_callback, after_predict_callback, new_df, horizon, new_data)
521 partition_results = self.partition_results
522 schema = self._get_predict_schema()
--> 523 res = fa.transform(
524 partition_results,
525 DistributedMLForecast._predict,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-14049207-8c62-4864-93a7-4bc1fb414ecc/lib/python3.8/site-packages/fugue/workflow/api.py in transform(df, using, schema, params, partition, callback, ignore_errors, persist, as_local, save_path, checkpoint, engine, engine_conf, as_fugue)
140 else:
141 raise
--> 142 tdf = src.transform(
143 using=using,
144 schema=schema,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-14049207-8c62-4864-93a7-4bc1fb414ecc/lib/python3.8/site-packages/fugue/workflow/workflow.py in transform(self, using, schema, params, pre_partition, ignore_errors, callback)
557 if pre_partition is None:
558 pre_partition = self.partition_spec
--> 559 df = self.workflow.transform(
560 self,
561 using=using,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-14049207-8c62-4864-93a7-4bc1fb414ecc/lib/python3.8/site-packages/fugue/workflow/workflow.py in transform(self, using, schema, params, pre_partition, ignore_errors, callback, *dfs)
2036 tf._has_rpc_client = not isinstance(callback, EmptyRPCHandler) # type: ignore
2037 tf.validate_on_compile()
-> 2038 return self.process(
2039 *dfs,
2040 using=RunTransformer,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-14049207-8c62-4864-93a7-4bc1fb414ecc/lib/python3.8/site-packages/fugue/workflow/workflow.py in process(self, using, schema, params, pre_partition, *dfs)
1698 """
1699 _dfs = self._to_dfs(*dfs)
-> 1700 task = Process(
1701 len(_dfs),
1702 processor=using,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-14049207-8c62-4864-93a7-4bc1fb414ecc/lib/python3.8/site-packages/fugue/workflow/_tasks.py in init(self, input_n, processor, schema, params, pre_partition, deterministic, lazy, input_names)
255 ):
256 self._processor = _to_processor(processor, schema)
--> 257 self._processor._params = ParamDict(params)
258 self._processor._partition_spec = PartitionSpec(pre_partition)
259 self._processor.validate_on_compile()

/local_disk0/.ephemeral_nfs/envs/pythonEnv-14049207-8c62-4864-93a7-4bc1fb414ecc/lib/python3.8/site-packages/triad/collections/dict.py in init(self, data, deep)
175 def init(self, data: Any = None, deep: bool = True):
176 super().init()
--> 177 self.update(data, deep=deep)
178
179 def setitem( # type: ignore

/local_disk0/.ephemeral_nfs/envs/pythonEnv-14049207-8c62-4864-93a7-4bc1fb414ecc/lib/python3.8/site-packages/triad/collections/dict.py in update(self, other, on_dup, deep)
262 for k, v in to_kv_iterable(other):
263 if on_dup == ParamDict.OVERWRITE or k not in self:
--> 264 self[k] = copy.deepcopy(v) if deep else v
265 elif on_dup == ParamDict.THROW:
266 raise KeyError(f"{k} exists in dict")

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
144 copier = _deepcopy_dispatch.get(cls)
145 if copier is not None:
--> 146 y = copier(x, memo)
147 else:
148 if issubclass(cls, type):

/usr/lib/python3.8/copy.py in _deepcopy_dict(x, memo, deepcopy)
228 memo[id(x)] = y
229 for key, value in x.items():
--> 230 y[deepcopy(key, memo)] = deepcopy(value, memo)
231 return y
232 d[dict] = _deepcopy_dict

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
170 y = x
171 else:
--> 172 y = _reconstruct(x, memo, *rv)
173
174 # If is its own copy, don't memoize.

/usr/lib/python3.8/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
268 if state is not None:
269 if deep:
--> 270 state = deepcopy(state, memo)
271 if hasattr(y, 'setstate'):
272 y.setstate(state)

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
144 copier = _deepcopy_dispatch.get(cls)
145 if copier is not None:
--> 146 y = copier(x, memo)
147 else:
148 if issubclass(cls, type):

/usr/lib/python3.8/copy.py in _deepcopy_dict(x, memo, deepcopy)
228 memo[id(x)] = y
229 for key, value in x.items():
--> 230 y[deepcopy(key, memo)] = deepcopy(value, memo)
231 return y
232 d[dict] = _deepcopy_dict

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
170 y = x
171 else:
--> 172 y = _reconstruct(x, memo, *rv)
173
174 # If is its own copy, don't memoize.

/usr/lib/python3.8/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
268 if state is not None:
269 if deep:
--> 270 state = deepcopy(state, memo)
271 if hasattr(y, 'setstate'):
272 y.setstate(state)

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
144 copier = _deepcopy_dispatch.get(cls)
145 if copier is not None:
--> 146 y = copier(x, memo)
147 else:
148 if issubclass(cls, type):

/usr/lib/python3.8/copy.py in _deepcopy_dict(x, memo, deepcopy)
228 memo[id(x)] = y
229 for key, value in x.items():
--> 230 y[deepcopy(key, memo)] = deepcopy(value, memo)
231 return y
232 d[dict] = _deepcopy_dict

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
170 y = x
171 else:
--> 172 y = _reconstruct(x, memo, *rv)
173
174 # If is its own copy, don't memoize.

/usr/lib/python3.8/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
268 if state is not None:
269 if deep:
--> 270 state = deepcopy(state, memo)
271 if hasattr(y, 'setstate'):
272 y.setstate(state)

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
144 copier = _deepcopy_dispatch.get(cls)
145 if copier is not None:
--> 146 y = copier(x, memo)
147 else:
148 if issubclass(cls, type):

/usr/lib/python3.8/copy.py in _deepcopy_dict(x, memo, deepcopy)
228 memo[id(x)] = y
229 for key, value in x.items():
--> 230 y[deepcopy(key, memo)] = deepcopy(value, memo)
231 return y
232 d[dict] = _deepcopy_dict

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
170 y = x
171 else:
--> 172 y = _reconstruct(x, memo, *rv)
173
174 # If is its own copy, don't memoize.

/usr/lib/python3.8/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
268 if state is not None:
269 if deep:
--> 270 state = deepcopy(state, memo)
271 if hasattr(y, 'setstate'):
272 y.setstate(state)

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
144 copier = _deepcopy_dispatch.get(cls)
145 if copier is not None:
--> 146 y = copier(x, memo)
147 else:
148 if issubclass(cls, type):

/usr/lib/python3.8/copy.py in _deepcopy_dict(x, memo, deepcopy)
228 memo[id(x)] = y
229 for key, value in x.items():
--> 230 y[deepcopy(key, memo)] = deepcopy(value, memo)
231 return y
232 d[dict] = _deepcopy_dict

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
170 y = x
171 else:
--> 172 y = _reconstruct(x, memo, *rv)
173
174 # If is its own copy, don't memoize.

/usr/lib/python3.8/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
268 if state is not None:
269 if deep:
--> 270 state = deepcopy(state, memo)
271 if hasattr(y, 'setstate'):
272 y.setstate(state)

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
144 copier = _deepcopy_dispatch.get(cls)
145 if copier is not None:
--> 146 y = copier(x, memo)
147 else:
148 if issubclass(cls, type):

/usr/lib/python3.8/copy.py in _deepcopy_dict(x, memo, deepcopy)
228 memo[id(x)] = y
229 for key, value in x.items():
--> 230 y[deepcopy(key, memo)] = deepcopy(value, memo)
231 return y
232 d[dict] = _deepcopy_dict

/usr/lib/python3.8/copy.py in deepcopy(x, memo, _nil)
159 reductor = getattr(x, "reduce_ex", None)
160 if reductor is not None:
--> 161 rv = reductor(4)
162 else:
163 reductor = getattr(x, "reduce", None)

Can you provide the command you're running? Please note that the dynamic dfs should be pandas dataframes, not spark.

Also this possibility will be removed in 0.11.0, which will be released in the next couple of days and is the version of the documentation. Can you describe your use case? The distributed interface is meant for training only, with the assumption that once that you've trained the model you can convert it to a local version and compute predictions with that. Would that work for you?

I am using the command fcst.predict(h = 12, dynamic_dfs = X_df).collect()

Thanks for explaining what it will be in the future.
For my usecase, it would be fine to pass in Pandas Dataframe.
I am expecting much faster compute with the distributed spark version.
(By the way, non spark version seems also to be pretty fast, but I guess with much larger dataset there will be a big difference? )

Yes, initially I gave the spark dataframe, but now I changed to the pandas, but getting a different error.
from mlforecast.distributed import DistributedMLForecast
from mlforecast.distributed.models.spark.xgb import SparkXGBForecast
models = [ SparkXGBForecast() ]
fcst = DistributedMLForecast(
models,
freq='W-SAT',
lags=[1],
lag_transforms={
1: [expanding_mean]
},
target_transforms = [Mean_Scaler()],
date_features=['dayofweek'],
)
fcst.fit(
Y_sf_train_cur_level,
static_features=['embedding_x', 'embedding_y'],
)
fcst.predict(h = 12, dynamic_dfs = X_df).collect()

X_df is pandas dataframe.

predict() got an unexpected keyword argument 'dynamic_dfs'

2023-11-07 08:15:33,616 INFO XGBoost-PySpark: _fit Finished xgboost training!
TypeError: predict() got an unexpected keyword argument 'dynamic_dfs'

TypeError Traceback (most recent call last)
in
22 static_features=['embedding_x', 'embedding_y'],
23 )
---> 24 fcst.predict(h = 12, dynamic_dfs = X_df).collect()
25 ## Currently not possible to get the X variable to work for the spark version

TypeError: predict() got an unexpected keyword argument 'dynamic_dfs'

Are you still on 0.10.0? We released 0.11.0 yesterday and that removed the dynamic_dfs argument. Also, it should be a list, e.g. fcst.predict(h = 12, dynamic_dfs = [X_df]).collect().

About the speed, the models use multithreading when predicting and we also have multithreading for the feature updates (if you set num_threads>1 in the forecast constructor), so if the data fits in one machine it's probably better to just use the regular interface. Also, 0.11.0 includes support for using lag transformations implemented in C++, which should be way faster for the predict step (guide).

Please let us know if this works, sorry for the troubles.

@jmoralez Yes I noticed the version has been upgraded!
Thanks for the great work!!

If you say that dynamic_dfs have been reserved, how can I add non-static external variables for the distributed version??

Sorry to mix the post with different library, but in neuralforecast, external variables aren't supported either??

It's not possible anymore because supporting distributed dataframes as X_df is very hard and passing pandas seemed like overkill, but if that works for you I think we could add the possibility of passing X_df as a pandas dataframe to the distributed version, which would then be broadcasted to all workers and each would use their corresponding series.

Neuralforecast does support external variables (guide).

@jmoralez Sure it would be great to have the opportunity to add external variables to the distributed version!
What would it mean though that we pass in pandas dataframe? Would I need to add additional steps to make the distributed version of the model work??

Neuralforecast does support external variables (guide).
This link seems to be not working anymore .. I saw lots of updates in the webpage.
Perhaps is it this one, in the newer version?
Cool that there are lots of different exogeneous variable types one can add!

@jmoralez
We encountered a specific scenario where we actively utilized the 'dynamic_dfs' argument in MLForecast. Our use case involves building multi-regressor time-series models, specifically a recursive XGBoost model. We predict one primary y feature, a date column, and two additional regressor columns.

In our approach, we apply lags and lag-transforms to both the primary y column and the additional regressor columns, along with various date features. However, we observed that MLForecast doesn't recursively predict the additional regressors. Consequently, we opted to predict them separately, utilized preprocessing to obtain lagged/transformed values of these newly predicted columns, and then incorporated them into the main forecast model using 'dynamic_dfs'. This ensures their use as time-based external regressors when predicting the primary y.

Given the removal of the 'dynamic_dfs' argument, are there alternative features or methods within MLForecast that could still support our specific use case? Any insights you can provide would be greatly appreciated.

Edit: Digging around the docstrings for the .fit and .predict functions, I've discovered that when fitting you can set static_features = [], then when predicting you can now provide an X_df. The docstring reads as if this can be used to provide future xregressor values to the predict function. Is this the correct way now to go about this?

Hey @Markpajr. For the local interface the dynamic_dfs argument was deprecated in favor of the X_df argument, which takes a single dataframe with the ids, dates and future values of the exogenous features. So if you were using a single dataframe like dynamic_dfs=[df] you can now use X_df=df. Also we recently incorporated a function to compute the transformations on exogenous features, you may find this guide useful, since it seems to be what you're currently doing.