hunkim / DeepLearningZeroToAll

TensorFlow Basic Tutorial Labs

Home Page:https://www.youtube.com/user/hunkims

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

possible look-ahead bias in lab-12-5-rnn-stock

goodcheer opened this issue · comments

In current code, the whole data are scaled and then partitioned into train and test set,
which utilize future (test set) information when scaling past (train set) data.

# Open, High, Low, Volume, Close
xy = np.loadtxt('data-02-stock_daily.csv', delimiter=',')
xy = xy[::-1]  # reverse order (chronically ordered)
xy = MinMaxScaler(xy)
x = xy
y = xy[:, [-1]]  # Close as label

# build a dataset
dataX = []
dataY = []
for i in range(0, len(y) - seq_length):
    _x = x[i:i + seq_length]
    _y = y[i + seq_length]  # Next close price
#     print(_x, "->", _y)
    dataX.append(_x)
    dataY.append(_y)

# train/test split
train_size = int(len(dataY) * 0.7)
test_size = len(dataY) - train_size
trainX, testX = np.array(dataX[0:train_size]), np.array(
    dataX[train_size:len(dataX)])
trainY, testY = np.array(dataY[0:train_size]), np.array(
    dataY[train_size:len(dataY)])

However, i think it makes more sense to scale train and test set separately; since you would not get data of future (test set), when scaling current and past data (train set).

In real life scenarios, we know nothing about future data at the point of making prediction. If we include data in the testing set to compute the sample mean, we would inadvently introduce future information into historical data, which would render the prediction useless. comment from

Therefore, I think the whole process should be rather like:

  • train/test partition -> scale on each set -> build dataset

Comparing results of two different Preprocessing

  1. Original

original

  1. Proposed

proposed

More details at
lab-12-5-rnn_stock_issue_campare.ipynb

You're absolutely right.
Just didn't have time to work on that lately.

If you can send a PR, it'd be greatly appreciated.

Thank you. It is my honor to do a PR on ZeroToAll !

I had a mistake on drawing plots. The original and proposed should almost look same. Still, RSME differs.

Closing as #215 was merged