Where can I download the dataset?

Question

Where can I download the dataset?

datong-new opened this issue 4 years ago · comments

datong-new commented 4 years ago

Hello, thanks for the wonderful work!

Can you give more details about the dataset? And where can I download the dataset?

Thank you!

Max Cohen · Answer 1 · Sat Jan 18 2020 01:36:31 GMT+0800 (China Standard Time)

Hi,

You can find a sample of the dataset, as well as a brief description, as an open data challenge, in csv format. You will have to transpose it to npz format, or use a custom pytorch dataset (see the challenge demo repo), in order to use the notebooks.

Bests

HuskyLens · Answer 2 · Mon Feb 03 2020 13:10:26 GMT+0800 (China Standard Time)

Hi,
Would you like to share the npz file? As the data structure from Open Data Challenge seems different from yours.
See the difference:
Yours
Origin

Max Cohen · Answer 3 · Thu Feb 06 2020 23:07:33 GMT+0800 (China Standard Time)

Hi, I can't share a npz file containing any other data than the ones uploaded on the data challenge, as it would go against the very rules of the challenge.
The structure of the labels is different, but that shouldn't be an issue if you just want to convert the csv dataset to npz, as the code was written with these possible modifications in mind. Just load the csv with the OzeDataset class, and export R, Z and X using np.savez. You're aiming at this kind of data structure.

Francis Duan · Answer 4 · Thu Mar 26 2020 06:24:34 GMT+0800 (China Standard Time)

Hi do you have any code that could transform the csv to npz, I am not sure what we should include in the npz

Max Cohen · Answer 5 · Fri Apr 03 2020 19:24:39 GMT+0800 (China Standard Time)

Once again, all needed information are present in the challenge benchmark repo, but to prevent further questions on the dataset I have drafted a function to convert csv to npz.

Daniel @ Krypton · Answer 6 · Fri May 01 2020 15:37:26 GMT+0800 (China Standard Time)

Dear @maxjcohen , I joined the challenge 28, downloaded the following files:

x_train_LsAZgHU.csv
y_train_EFo1WyE.csv
x_test_QK7dVsy.csv

Then I copied csv2npz script to utils folder within the project.
Then I created and ran the following python script at project's root folder:

from src.utils.csv2npz import csv2npz

csv2npz('datasets/x_train_LsAZgHU.csv', 'datasets/y_train_EFo1WyE.csv')

But unfortunately it errored as can be seen below.

Traceback (most recent call last):
  File "/home/<username>/Workspaces/Python/transformer/generateNpz.py", line 3, in <module>
    csv2npz('datasets/x_train_LsAZgHU.csv', 'datasets/y_train_EFo1WyE.csv')
  File "/home/<username>/Workspaces/Python/transformer/src/utils/csv2npz.py", line 21, in csv2npz
    R = x[labels["R"]].values
  File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/frame.py", line 2806, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/indexing.py", line 1553, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
  File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/indexing.py", line 1646, in _validate_read_indexer
    raise KeyError(f"{not_found} not in index")
KeyError: "['initial_temperature', 'roof_1_thickness_3'] not in index"

Max Cohen · Answer 7 · Sat May 02 2020 16:01:29 GMT+0800 (China Standard Time)

Hi, this error means that the index "initial_temperature" and "roof_thickness_3" are not present in the challenge dataset. Indeed, if you take the original labels.json, these values are not present, because they were not intended to be used in the challenge.

In order to solve your error, I recommend using the original labels file from the benchmark repo.

Daniel @ Krypton · Answer 8 · Sun May 03 2020 14:39:06 GMT+0800 (China Standard Time)

I created a pull request #6 with some improvements I came up with up to now, it might be useful to merge @maxjcohen, please advise.

Jiangsheng You · Answer 9 · Fri Jan 08 2021 04:58:46 GMT+0800 (China Standard Time)

I am looking at your project and try to process different dataset. If convenient, please describe the data format so I can process any data beyond the challenge dataset only. Thanks.

Max Cohen · Answer 10 · Mon Jan 18 2021 16:32:30 GMT+0800 (China Standard Time)

Hi, there is no particular data format to use with the Transformer beside the input shape specified in the documentation.

We currently handle our data using the OzeDataset class, inherited from PyTorch's Dataset class. As the format here is a bit specific, I encourage you to write your own Dataset inherited class fitting your data, and feed it to the Transformer.

Zijian · Answer 11 · Thu Feb 25 2021 15:30:33 GMT+0800 (China Standard Time)

Hi, thanks for the reference for the helpful data loading function. Just one minor tip here.

The original data loader uses X.values.reshape((m,-1,k)) where m is the number of observations and k is the length of time series. However, a normal LSTM or Transformer model accepts an input vector in shape (batch, time series length, num_feature). Thus the reshaping of (m, k, -1) is recommended. Same for variable "Z" (have to point out that the naming is quite confusing at the first glance.)
X = X.values.reshape((m, K, -1))
Z = Z.values.reshape((m, K, -1))

For the labels.jason, I delete "week" and "light_blabla_mask" (can't remember the name but the error message alert me that this index is not found). You can also refer to the data specification on Challenge website https://challengedata.ens.fr/participants/challenges/28/ to modify your labels.jason

My final input vector size is (8, 672, 18) (8 batches, 672 time-series, 18 features ignoring room-paras.) - 2021 / 2 / 25

Max Cohen · Answer 12 · Sat Feb 27 2021 18:31:22 GMT+0800 (China Standard Time)

LSTM in pytorch accepts a vector of shape (time series length, batch, num_features), see the docs.

Diego Quintana · Answer 13 · Tue Apr 20 2021 00:52:16 GMT+0800 (China Standard Time)

I managed to get a .npz file using the labels.json from https://raw.githubusercontent.com/maxjcohen/ozechallenge_benchmark/master/labels.json and the code from https://gist.github.com/diegoquintanav/050765be2ff3f4cfcf7c25da645cfcc2

However, in the notebook in https://timeseriestransformer.readthedocs.io/en/latest/notebooks/trainings/training_2020_06_27__164648.html#Load-dataset the dataset used has (I think) 25k rows (the one downloaded from the ozechallenge has 7500

$ wc -l dataset/x_train_LsAZgHU.csv 
7501 dataset/x_train_LsAZgHU.csv

If I change the splits to dataset_train, dataset_val, dataset_test = random_split(ozeDataset, (5500, 1000, 1000)), I hit an error in the cell that does the training:

[Epoch   1/30]:   0%|          | 0/5500 [00:00<?, ?it/s]

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-20-4b3396332a6c> in <module>
     12 
     13             # Propagate input
---> 14             netout = net(x.to(device))
     15 
     16             # Comupte loss

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/code/notebooks/transformers/transformer/tst/transformer.py in forward(self, x)
    123 
    124         # Embeddin module
--> 125         encoding = self._embedding(x)
    126 
    127         # Add position encoding

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/linear.py in forward(self, input)
     92 
     93     def forward(self, input: Tensor) -> Tensor:
---> 94         return F.linear(input, self.weight, self.bias)
     95 
     96     def extra_repr(self) -> str:

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/functional.py in linear(input, weight, bias)
   1751     if has_torch_function_variadic(input, weight):
   1752         return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
-> 1753     return torch._C._nn.linear(input, weight, bias)
   1754 
   1755 

RuntimeError: mat1 dim 1 must match mat2 dim 0

What is this 'datasets/dataset_57M.npz'? and what are X, R and Z? thanks!

Max Cohen · Answer 14 · Tue Apr 20 2021 15:57:03 GMT+0800 (China Standard Time)

Hi, the dataset from the challenge and the one I'm using on this repo are quite different, this is why dimensions don't match. If you want to use this Transformer for the challenge, you'll have to make a few ajdustements.

As for your question about X, R and Z, you can check #28 .

Diego Quintana · Answer 15 · Tue Apr 20 2021 16:48:22 GMT+0800 (China Standard Time)

Hi!, thanks for answering.

Can you tell me more about the differences? For example, what are the shapes of X, R, and Z indataset_57M.npz? Also, I'm lost when you say that

If you want to use this Transformer for the challenge, you'll have to make a few adjustments.

Is this not what is going on in this repo? In the readme, you say that the dataset used to train this transformer is the one from the challenge, but that does not seem to be the case. Can you tell me more about what are the adjustments needed?

Max Cohen · Answer 16 · Mon May 03 2021 15:18:57 GMT+0800 (China Standard Time)

The variables X, R and Z are proper to the challenge dataset, and completely independent from the Transformer model. They simply describe the dataset, with 2 inputs instead of the usual one:

R contains the characteristics of the building, which don't change with time, and are concatenated with Z to serve as input. Shape should be (n_samples, n_characteristics).
Z contains the input time series. Shape should be (n_samples, time_steps, n_input_variables).
X contains the output time series. Shape should be (n_samples, time_steps, n_output_variables).

The original dataset from the challenge has been modified, for instance some variables where removed from R, some added to Z, etc. But the content is roughly the same, and should be sufficient for trying out the Transformer. All changes can be found in the files labels.json.

Please keep in mind that the dataset dataset_57M.npz is not available for download.

Inkyu · Answer 17 · Wed May 12 2021 17:51:48 GMT+0800 (China Standard Time)

Thanks to the author for the great intuitions and efforts.

For those who may have issues related to the dataset, you might be able to try this that I slightly modified according to the author's suggestions.
https://github.com/afters-cool/transformer

and dataset
https://github.com/afters-cool/transformer/releases/tag/v0.0.1

You can check some plots resulted from the code above (don't know whether it's correct or not).
https://github.com/afters-cool/transformer/tree/master/assets

Hope this helped someone.

sarraAyed · Answer 18 · Thu Oct 14 2021 00:17:46 GMT+0800 (China Standard Time)

The dataset of the challenge contain a file named x_train and y_train. Do they complement each other or one of them is enough ?
Plus, If my data are already in a csv file, can't I just devide them into train, test and validate directly and just use them ?

Max Cohen · Answer 19 · Fri Oct 15 2021 17:55:50 GMT+0800 (China Standard Time)

Hi, yes they complement each other, x_train are the command (input vectors) while y_train are the observations (output vectors). You are, of course, free to divide your data however you desire.
In the future, please keep discussions about the challenge in the challenge repo.

Yanfei · Answer 20 · Mon Apr 10 2023 20:37:56 GMT+0800 (China Standard Time)

Thank you for your work!

Zhongxian Men · Answer 21 · Wed May 17 2023 10:16:33 GMT+0800 (China Standard Time)

I am new to Transformer methods. Can the package accept csv files directly instead of .npz files?

Max Cohen · Answer 22 · Fri Jun 02 2023 15:54:47 GMT+0800 (China Standard Time)

In this repo, we define a Transformer model that takes as inputs Tensors, see the documentation. We present examples loading data as .npz files, but you can load data however you want.

yyldtc · Answer 23 · Tue Apr 09 2024 10:22:48 GMT+0800 (China Standard Time)

可以把数据集这一块，做一个详细的解释吗，我已经下载了这两个数据集dataset.npz和lable.json，也放在了目录中，但还是无法运行代码

Max Cohen · Answer 24 · Wed Apr 10 2024 23:32:08 GMT+0800 (China Standard Time)

Hi @yyldtc , from what I was able to translate from your message, something is still not working with the dataset. Could you detail the error that you got in a new issue ? I'll take a look.