train/val/test datasets splitting

Question

train/val/test datasets splitting

simona-0 opened this issue 2 months ago · comments

First off, thank you for your proposed model. I am working on a uni project based on your work, the goal is to fine-tune lag-llama for a custom dataset. I have some questions for the train/val/test splitting. The dataset I'm working with contain ~1000 trajectories of the same sensor (stored in the columns of the df), their lengths vary (padded with NAN).

The splitting used in the zero-shot notebook:

train_data = [{"start": df.index[0], "target": df[i].values[:-prediction_length]} for i in df.columns] 
test_data = [{"start": df.index[0], "target": df[i].values} for i in df.columns]

I don't fully understand how the train and test datasets can be from the same data frame? What I have in mind would be a split like the following, where 70% of the time series are used for training and 30% for testing.

train_df = df[columns[:int(0.7 * len(columns))]]
test_df = df[columns[int(0.7 * len(columns)):]]
train_data = [{"start": train_df.index[0], "target": train_df[i].values} for i in train_df.columns]
test_data = [{"start": test_df.index[0], "target": test_df[i].values} for i in test_df.columns]

Also would K-fold cross validation be adaptable for the model? Thank you.

Arjun Ashok · Answer 1 · Mon Jun 03 2024 02:12:00 GMT+0800 (China Standard Time)

Hi,

Thanks for the kind words.

By ~1000 trajectories of the same sensor, do you mean you have just 1 time series in your dataset?

And as for the splitting strategy, what we propose is just one way to split the data. In typical time series prediction, the splits are made across time. You could also do multiple such splits in time slightly different from each other, which is called backtesting. That would be a well-established cross-validation technique for time series.

You could also split your data across series as you propose.

Ultimately, your split should reflect the goal of training the model: if you want it to generalize to new series, you should use a split that you propose. If you want it to generalize to the same series, but unseen (future) dates, you should use the one we use.

junbo.l · Answer 2 · Mon Jun 03 2024 05:16:40 GMT+0800 (China Standard Time)

Hi, thank you for your quick yet detailed reply, I really appreciate it. The dataset at hand consists of ~1000 of this kind of trajectories, all from the same sensor. So it's not really one single time series throughout. Also as you can see, there isn't much of any cyclic pattern, only upward or downward trends can be spotted.

I first tried with the splitting you suggested in the zero-shot notebook, yet the results show the model failed to predict trends, I guess due to the fact that the model didn't get to see the later stage of the signal, cos they were set hidden during training. I will now experiment with other splitting strategies. Thanks again!