possible look-ahead bias in lab-12-5-rnn-stock

Question

possible look-ahead bias in lab-12-5-rnn-stock

goodcheer opened this issue 6 years ago · comments

In current code, the whole data are scaled and then partitioned into train and test set,
which utilize future (test set) information when scaling past (train set) data.

# Open, High, Low, Volume, Close
xy = np.loadtxt('data-02-stock_daily.csv', delimiter=',')
xy = xy[::-1]  # reverse order (chronically ordered)
xy = MinMaxScaler(xy)
x = xy
y = xy[:, [-1]]  # Close as label

# build a dataset
dataX = []
dataY = []
for i in range(0, len(y) - seq_length):
    _x = x[i:i + seq_length]
    _y = y[i + seq_length]  # Next close price
#     print(_x, "->", _y)
    dataX.append(_x)
    dataY.append(_y)

# train/test split
train_size = int(len(dataY) * 0.7)
test_size = len(dataY) - train_size
trainX, testX = np.array(dataX[0:train_size]), np.array(
    dataX[train_size:len(dataX)])
trainY, testY = np.array(dataY[0:train_size]), np.array(
    dataY[train_size:len(dataY)])

However, i think it makes more sense to scale train and test set separately; since you would not get data of future (test set), when scaling current and past data (train set).

In real life scenarios, we know nothing about future data at the point of making prediction. If we include data in the testing set to compute the sample mean, we would inadvently introduce future information into historical data, which would render the prediction useless. comment from

Therefore, I think the whole process should be rather like:

train/test partition -> scale on each set -> build dataset

Comparing results of two different Preprocessing

Original

Proposed

More details at
lab-12-5-rnn_stock_issue_campare.ipynb

Mo Kweon · Answer 1 · Wed Sep 19 2018 01:36:09 GMT+0800 (China Standard Time)

You're absolutely right.
Just didn't have time to work on that lately.

If you can send a PR, it'd be greatly appreciated.

Kyu Chul Kim · Answer 2 · Wed Sep 19 2018 01:55:30 GMT+0800 (China Standard Time)

Thank you. It is my honor to do a PR on ZeroToAll !

Kyu Chul Kim · Answer 3 · Wed Sep 19 2018 07:26:27 GMT+0800 (China Standard Time)

I had a mistake on drawing plots. The original and proposed should almost look same. Still, RSME differs.

Mo Kweon · Answer 4 · Sat Sep 29 2018 01:34:45 GMT+0800 (China Standard Time)

Closing as #215 was merged