MuhammadTaha / Predictive-Analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Input data for RNNs

nielsrolf opened this issue · comments

For regression models we can extract features for a given row and have (row_features, label) tuples that are independent of each other.
For RNNs, we need time series data. We need to implement the method Data._get_time_series( store_id), and in Data._prepare_time_series() we need to split the data into test, validation and test sets. I am not sure how the data should be split, my guess was to split the stores (we can get one time series for each store). If that is the case, we have to think of a consistent way to fetch the test and validation data sets, for not time series data it simply returns one huge (X, y) thing.
Data.next_batch gives one complete random time series as the skeleton is implemented at the moment. Maybe this also needs to be changed.

Currently, we have a class DataExtraction that reads the csv data and extracts features from it. You can ask it to extract certain rows.
For the linear regressor, I wrote a class Data, you can rename this class to FeedForwardDataand implement it new so that it works with your LSTM. This class has the following functionalities:

  • Split the data in test/validatation/train data
  • next_train_batch() gives the next train batch, this should now always return a list of data of only one store, and for consecutive dates. As I mentioned in #17 your function should also add as feature the sales of the last days, so it will be next_train_batch(forecaster)
  • Validation and test data can no longer be passed as single batch, because the network can only process one time series at a time. The solution that comes to my mind is to implement
    validation_batches(forecaster) that give a list of time series like [(X_1, y_1), (X_2, y_2), ...]
    Please document somewhere how to process the whole validation and test data set. Maybe implement a method in abstract forecaster, that takes as input the output of your validation_batches method and gives the validation accuracy.

Basically the current Dataclass should be replaced by something that always returns time series.

Also please add constants to the class that tell the index of the features, eg if store_is_openis the 8th column in a batch, save store_is_open = 8. Then we can access certain features as X[:,Data.store_is_open]. This is useful for #20

I have implemented the two functions but for the second one what do you want in y ordinate?

You can see the code in the lstm branch