how to properly preprocess the raw data?

Question

how to properly preprocess the raw data?

shane-huang opened this issue 4 years ago · comments

Hey guys, really impressive work and thanks for sharing the code.

We're trying to use DeepGLO to process datasets other than the four used in the paper, and kind of got stuck at the preprocessing stage. It would be great if you could share any specification or scripts about how to properly preprocess the raw data from the public datasets used in the paper.

It seems there's much difference between the original data and processed data (eg. electricy.npy, etc.). For example, I have downloaded raw electricity data from https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, and did resample and fillna as follows.

df = raw_data.resample('1H',label = 'left',closed = 'right').sum()
df.fillna(0, inplace=True)

The last 10 data points of the 1st series, i.e. "MT_001" in the original dataset looks below:

2014-09-07 14:00:00    63.451777
2014-09-07 15:00:00    60.913706
2014-09-07 16:00:00    58.375635
2014-09-07 17:00:00    62.182741
2014-09-07 18:00:00    77.411168
2014-09-07 19:00:00    36.802030
2014-09-07 20:00:00    13.959391
2014-09-07 21:00:00    46.954315
2014-09-07 22:00:00    65.989848
2014-09-07 23:00:00    65.989848

On the other hand, the last 10 datapoints of the 1st series in the "electricity.npy" looks like below. Apparently the values are much different from the original time series values.

array([3.8071, 3.8071, 5.0761, 6.3452, 6.3452, 7.6142, 7.6142, 7.6142,
       7.6142, 7.6142])

Maybe I've missed something here...
It would be really helpful if you could share how this electricity.npy is processed from the raw data as above.

Rajat Sen · Answer 1 · Mon Jun 01 2020 09:50:54 GMT+0800 (China Standard Time)

Thanks for the question. For the electricity and traffic datasets, we follow the processed version of the datasets from the TRMF paper: https://www.cs.utexas.edu/~rofuyu/papers/tr-mf-nips.pdf

You can look at appendix A of the above paper to get an idea of the preprocessing used. In general if you represent a multi-variate time-series datasets as a n*t matrix with rows as different time-series and columns as time-points you should be able to use the code. Please set the "freq" parameter properly to reflect whether the data is daily or hourly etc.

Shengsheng Huang · Answer 2 · Mon Jun 01 2020 11:15:16 GMT+0800 (China Standard Time)

Thanks for the question. For the electricity and traffic datasets, we follow the processed version of the datasets from the TRMF paper: https://www.cs.utexas.edu/~rofuyu/papers/tr-mf-nips.pdf

You can look at appendix A of the above paper to get an idea of the preprocessing used. In general if you represent a multi-variate time-series datasets as a n*t matrix with rows as different time-series and columns as time-points you should be able to use the code. Please set the "freq" parameter properly to reflect whether the data is daily or hourly etc.

Thanks so much for the guide. :) Again, it's really a great job.