ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_793c7e93

Question

ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_793c7e93

LeonTing1010 opened this issue 3 months ago · comments

What happened + What you expected to happen

(_train_tune pid=59932) /Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead.
(_train_tune pid=59932) Seed set to 1
2024-05-01 01:27:11,649 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_793c7e93
Traceback (most recent call last):
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::ImplicitFunc.train() (pid=59932, ip=127.0.0.1, actor_id=b48464a8f9278052285d8c3c01000000, repr=_train_tune)
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 330, in train
raise skipped from exception_cause(skipped)
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(*self._args, **self._kwargs)
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 45, in
training_func=lambda: self._trainable_func(self.config),
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 253, in _trainable_func
output = fn()
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/ray/tune/trainable/util.py", line 130, in inner
return trainable(config, **fn_kwargs)
File "/Users/leo/web3/LLM/langchain/neuralforecast/neuralforecast/common/_base_auto.py", line 209, in _train_tune
_ = self._fit_model(
File "/Users/leo/web3/LLM/langchain/neuralforecast/neuralforecast/common/_base_auto.py", line 357, in _fit_model
model = model.fit(
File "/Users/leo/web3/LLM/langchain/neuralforecast/neuralforecast/common/_base_multivariate.py", line 537, in fit
return self._fit(
File "/Users/leo/web3/LLM/langchain/neuralforecast/neuralforecast/common/_base_model.py", line 218, in _fit
trainer = pl.Trainer(**model.trainer_kwargs)
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
return fn(self, **kwargs)
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 431, in init
self._callback_connector.on_trainer_init(
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py", line 79, in on_trainer_init
_validate_callbacks_list(self.trainer.callbacks)
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py", line 227, in _validate_callbacks_list
stateful_callbacks = [cb for cb in callbacks if is_overridden("state_dict", instance=cb)]
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py", line 227, in
stateful_callbacks = [cb for cb in callbacks if is_overridden("state_dict", instance=cb)]
File "/Users/leo/web3/LLM/langchain/venv/lib/python3.10/site-packages/pytorch_lightning/utilities/model_helpers.py", line 42, in is_overridden
raise ValueError("Expected a parent")
ValueError: Expected a parent

Versions / Dependencies

Name: neuralforecast
Version: 1.7.1
Summary: Time series forecasting suite using deep learning models
Home-page: https://github.com/Nixtla/neuralforecast/
Author: Nixtla
Author-email: business@nixtla.io
License: Apache Software License 2.0

Reproduction script

Y_hat_df = nf.cross_validation(df=Y_train_df,
val_size=val_size,
test_size=test_size,
n_windows=None
)

Issue Severity

High: It blocks me from completing my task.

Leon · Answer 1 · Wed May 01 2024 01:52:42 GMT+0800 (China Standard Time)

from neuralforecast.auto import AutoTSMixer, AutoTSMixerx
from ray.tune.search.hyperopt import HyperOptSearch
from ray import tune
from neuralforecast.losses.numpy import mse, mae
import matplotlib.pyplot as plt
import pandas as pd

from datasetsforecast.long_horizon import LongHorizon
from neuralforecast.core import NeuralForecast
from neuralforecast.models import TSMixer, TSMixerx, NHITS, MLPMultivariate, iTransformer
from neuralforecast.losses.pytorch import MSE, MAE

Change this to your own data to try the model

Y_df, X_df, _ = LongHorizon.load(directory='./', group='ETTm2')
Y_df['ds'] = pd.to_datetime(Y_df['ds'])

X_df contains the exogenous features, which we add to Y_df

X_df['ds'] = pd.to_datetime(X_df['ds'])
Y_df = Y_df.merge(X_df, on=['unique_id', 'ds'], how='left')

We make validation and test splits

n_time = len(Y_df.ds.unique())
val_size = int(.2 * n_time)
test_size = int(.2 * n_time)
horizon = 96
input_size = 512

tsmixer_config = {
"input_size": input_size, # Size of input window
"max_steps": tune.choice([500, 1000, 2000]), # Number of training iterations
"val_check_steps": 100, # Compute validation every x steps
"early_stop_patience_steps": 5, # Early stopping steps
"learning_rate": tune.loguniform(1e-4, 1e-2), # Initial Learning rate
"n_block": tune.choice([1, 2, 4, 6, 8]), # Number of mixing layers
"dropout": tune.uniform(0.0, 0.99), # Dropout
"ff_dim": tune.choice([32, 64, 128]), # Dimension of the feature linear layer
"scaler_type": 'identity',
}

tsmixerx_config = tsmixer_config.copy()
tsmixerx_config['futr_exog_list'] = ['ex_1', 'ex_2', 'ex_3', 'ex_4']
modelx = AutoTSMixerx(h=horizon,
n_series=7,
loss=MAE(),
config=tsmixerx_config,
num_samples=10,
search_alg=HyperOptSearch(),
backend='ray',
valid_loss=MAE())

nf = NeuralForecast(models=[modelx], freq='15min')
Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
test_size=test_size, n_windows=None)
print(nf.models[0].results.get_best_result().config)
y_true = Y_hat_df.y.values
y_hat_tsmixerx = Y_hat_df['AutoTSMixerx'].values

print(f'MAE TSMixerx: {mae(y_hat_tsmixerx, y_true):.3f}')
print(f'MSE TSMixerx: {mse(y_hat_tsmixerx, y_true):.3f}')

Olivier Sprangers · Answer 2 · Tue May 07 2024 01:49:42 GMT+0800 (China Standard Time)

Thanks - this is weird, if I run your code it runs without any issue.

Can you give more details about the machine config (OS, Python) you are using? How are you running this script?

If I'd have to guess it's a package conflict issue - so I would create a new virtual environment, install neuralforecast in that environment, and try rerunning the script.

github-actions · Answer 3 · Fri Jun 07 2024 12:01:11 GMT+0800 (China Standard Time)

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one.