Machine Learning Nanodegree

Capstone Project

Project: Stock Price Prediction

Discalimer: all stock prices historical data were downloaded from Yahoo Finance.

Discalimer: lstm.py was provided as part of the project files.

Definition

Problem Statement

As already stated in the “Problem Statement” of the Capstone project description in this area, the task will be to build a predictor which will use historical data from online sources, to try to predict future prices. The input to the ML model prediction should be only the date range, and nothing else. The predicted prices should be compared against the available prices for the same date range in the testing period.

Metrics

The metrics used for this project will be the R^2 scores between the actual prices in the testing period, and the predicted prices by the model in the same period.

There are also another set of metrics that could be used, that are indicative, which is the percent difference in absolute values between real prices and predicted ones. However, for machine learning purposes (training and testing), R^2 scores would be more reliable measures.

Analysis

Data Exploration

First, let's explore the data .. Downloading stock prices for Google.

For that purpose, I have built a special class called StockRegressor, that has the ability to download and store the data in a Pandas DataFrame.

First step, is to import the class.

%matplotlib inline

import numpy as np
np.random.seed(0)

import time
import datetime
from calendar import monthrange
import pandas as pd
from IPython.display import display 
from IPython.display import clear_output

from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

from StockRegressor import StockRegressor
from StockRegressor import StockGridSearch

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (15,8)

# initializing numpy seed so that we get reproduciable results, especially with Keras

The First StockRegressor Object

Getting our first historical price data batch ...

After download the prices from the Yahoo Finance web services, the below StockRegressor instance will save the historical prices into the pricing_info DataFrame. As a first step of processing, we have changed the index of the DataFrame from 'dates' to 'timeline' which is an integer index.

The reason is that it is easier for processing, since the dates correspond to trading dates, and are not sequential: they do not include weekends or holidays, as seen by the gap below between 02 Sep 2016 and 06 Sep 2016, which must have corresponded to a long weekend (Labor Day?).

Note: Please note that there might be a bug in the Pandas library, that is causing an intermitten error with the Yahoo Finance web call. The bug could be traced to the file in /anaconda/envs/your_environment/lib/python3.5/site-packages/pandas/core/indexes/datetimes.py, at line 1050: This line is causing the error: "if this.freq is None:". Another if condition should be inserted before that, to test for the "freq" attribute, such as: "if hasattr(this, 'freq'):"

Note: The fixed datetimes.py file is included with the submission

stock = StockRegressor('GOOG', dates= ['2014-10-01', '2016-04-30'])
display(stock.pricing_info[484:488])

Getting pricing information for GOOG for the period 2014-10-01 to 2016-09-27
Found a pricing file with wide range of dates, reading ... Stock-GOOG-1995-12-27-2017-09-05.csv

	Open	High	Low	Close	Adj Close	Volume	dates	timeline
timeline
484	769.250000	771.020020	764.299988	768.780029	768.780029	925100	2016-09-01	484
485	773.010010	773.919983	768.409973	771.460022	771.460022	1072700	2016-09-02	485
486	773.450012	782.000000	771.000000	780.080017	780.080017	1442800	2016-09-06	486
487	780.000000	782.729980	776.200012	780.349976	780.349976	893700	2016-09-07	487

stock.adj_close_price['dates'].iloc[stock.testing_end_date]

Timestamp('2016-07-13 00:00:00')

The Impact of the 'Volume' Feature

The next step would be to eliminate all the columns that are not needed. The columns 'Open', 'High', 'Low', 'Close' will all be discarded, because we will be working with the 'Adj Close' prices only.

For 'Volume', let's explore the relevance below.

From the below table and graph, we conclude that Volume has very little correlation with prices, and so we will drop it from discussion from now on.

There might be evidence that shows that there is some correlation between spikes in Volume and abrupt changes in prices. That might be logical since higher trading volumes might lead to higher prices fluctuations. However, these spikes in volume happen on the same day of the changes in prices, and so have little predictive power. This might be a topic for future exploration.

from sklearn.preprocessing import MinMaxScaler

scaler_volume = MinMaxScaler(copy=True, feature_range=(0, 1))
scaler_price = MinMaxScaler(copy=True, feature_range=(0, 1))

prices = stock.pricing_info.copy()

prices = prices.drop(labels=['Open', 'High', 'Low', 'Close', 'dates', 'timeline'], axis=1)

scaler_volume.fit(prices['Volume'].reshape(-1, 1))
scaler_price.fit(prices['Adj Close'].reshape(-1, 1))

prices['Volume'] = scaler_volume.transform(prices['Volume'].reshape(-1, 1))
prices['Adj Close'] = scaler_price.transform(prices['Adj Close'].reshape(-1, 1))

print("\nCorrelation between Volume and Prices:")
display(prices.corr())
prices.plot(kind='scatter', x='Adj Close', y='Volume')

Correlation between Volume and Prices:

	Adj Close	Volume
Adj Close	1.00000	-0.06493
Volume	-0.06493	1.00000

<matplotlib.axes._subplots.AxesSubplot at 0x1100e5ac8>

Exploratory Visualization

Now let's explore the historical pricing .. For that purpose, we have built two special purpose functions into the StockRegressor class.

The first plotting function will show the "learning_df" DataFrame. This is the dataframe that will be used to store all "workspace" data, i.e. dates, indexes, prices, predictions of multiple algorithms.

The second plotting function which will be less frequently used is a function that plots prices with the Bollinger bands. This is for pricing exploration only.

Below, we call those two functions. As we haven't trained the StockRegressor, the plot_learning_data_frame() function will show the learning_df dataframe with only the pricing, and a vertical red line which marks the end of the "training" period at a "cutoff" date, after which, a prediction by the various algorithms will be made, i.e. testing period.

This "cutoff" date corresponds to the end date supplied in the StockRegressor constructor. The StockRegressor instance will also make sure to get a few days ahead of the end of the testing phase, just for plotting purposes, to see the trend of the prices even beyond the testing period.

stock.plot_learning_data_frame()
stock.plot_bollinger_bands()

Features to be Used for the Prediction

Since stock price is really a time series, then there is really not many features that could be used for predictions, and for training the ML models. In fact, all there is to feed the ML model for prediction is the date.

Now for advanced techniques, some ML algos can take advantage of the stock news (ML sentiment analysis), to learn if the news is positive or negative, and how it will affect pricing, maybe using Reinforcement Learning.

Another area of exploration for ML models, is the "Fundamental Analysis" valuation techniques of the company stocks. With the fundamental analysis valuation, an analyst can deduce how the stock should be priced in an ideal world - given past cash flow of dividends, or capital gains / price appreciation in the case of no dividends. Therefore, the analyst can deduce if a stock is currently under-priced or over-priced given its valuation. This categorization or "fundamental price" can then be used as an input feature, in addition to the date input feature for future prediction.

In this project, I will not be exploring news or fundamental analysis. All that will be done is to predict the future prices based solely on the date. That means that this ML model will have only one feature which is the date (or an integer timeline index).

Algorithms and Techniques

In this section, we will be discussing the algorithms and techniques used for forecasting the stock pricing.

Given that we have only one input feature for prediction which is the date, then in the StockRegressor implementation, the following algorithms and technique have been implemented:

Linear Regression of polynomial order: regressions of multiple polynomial orders (1, 2, 3, etc..) can be trained in this model. The model has parametrized the polynomial order and the number of regressions, and therefore multiple regressions of differing polynomial orders can be run at the same time.
Recurrent Neural Network: a simple RNN has been implemented, to see how it compares to the linear regressions. The concept behind the RNN is that, if the stock displays a certain "pattern" in the time series, then the RNN will learn it - and remember it, and will use it for prediction. In this project however, and since the ML Nanodegree does not cover RNNs, I tried to keep this at a basic introductory level, to be explored in future work. I have used the basic publicly listed model created by Siraj Raval, with the reference included below.
Market Momentum: stock prices seem to be a function of the overall trend of the stock (i.e. regression over a long period, like the last year or two), as well as the market momentum for that stock that spans only the last few days (10 - 30 days range). Therefore a regression over only the last few days before the first prediction/testing date will indicate where the market is moving, and which direction the momentum is. A combination of the two regressions (long-term regression and momentum) will therfore give a better prediction.
Fast Fourier Transform: FFT is implemented here as an exploratory technique, to see if the stock prices display some harmonics, and which can be used to "lock in" on the price fluctuation. The FFT is applied on the de-trended data (real prices minus the regression), and re-constructed in a similar manner (using the FFT harmonics to construct predictive fluctuation, and then adding the underlying trend - or regression - again to give the total prediction price).
Grid Search: grid search will be used to optimize the hyper-parameters of the StockRegressor. There are lots of parameters that could be fine-tuned, such as how many regressions of different orders to use, or how many days to use for the momentum regression, etc..

Methodology

Data Preprocessing

What we do for preprocessing is minimal, since we only have one input feature which is the date. For that we create an integer "index" out of the dates, that is really just an incrementing counter, and we call it "timeline".

The next step would be to eliminate all the columns that are not needed. The columns 'Open', 'High', 'Low', 'Close' will all be discarded, because we will be working with the 'Adj Close' prices only, and store the result into a DataFrame called 'adj_close_price'. The 'adj_close_price', and the 'learning_df' are the two main dataframes which will be used all throughout the StockRegressor class.

Below is a glimpse of 'adj_close_price' ...

display(stock.adj_close_price.head())

	Adj Close	dates	timeline
timeline
0	566.714111	2014-10-01	0
1	568.519104	2014-10-02	1
2	573.704895	2014-10-03	2
3	575.769226	2014-10-06	3
4	562.196472	2014-10-07	4

Let's see how 'learning_df' looks before training ...

Note: In the 'learning_df' dataframe, you can see that there's a column called 'Rolling Mean-60'. This has been generated and used by the Bollinger graph plotting function, and is not really used in the ML models.

display(stock.learning_df.head())

	Adj Close	dates	timeline	Rolling Mean-60
timeline
0	566.714111	2014-10-01	0	537.17319
1	568.519104	2014-10-02	1	537.17319
2	573.704895	2014-10-03	2	537.17319
3	575.769226	2014-10-06	3	537.17319
4	562.196472	2014-10-07	4	537.17319

Implementation

Exploring Recurrent Neural Networks

Let's explore first the Recurrent Neural Network ML aglorithm. This model has been created by Siraj Raval, and has been replicated below to forecast stock prices. Please note that the exploration of RNNs here is very superficial, since I'm also enrolled in the Deep Learning Nanodegree, and haven't yet covered RNNs.

The model consists of two layers of LSTM nodes, the first having 50 input nodes with return_sequences set to True, and the second layer of 100 nodes. The model then has a 3rd layer of one Dense node that will output the prediction. The 50 input nodes means that the RNN will take in frames of 50 values from the time series, then the next input will be the same frame but shifted left, with a most recent price value inserted at the end of the frame. This way the RNN will try to "remember" sequences of values in the time series.

For yielding RNN predictions, one 50-sample frame is taken, and is used to generate one forecast value. This forecast value is then inserted into the end of the 50-sample frame (after left shifting), and then re-fed into the RNN. This is kept on, until all values in the 50-sample frame are forecasted values, and not real prices. This 50-sample frame is then plotted below.

Below will be displayed the prediction graph of the RNN, in addition to the R^2 scores.

Note: here's the references:

Note: from the graph plotted below, the RNN doesn't seem to yield a great forecast (it probably depends on the limited training data that was given below), and given my limited understanding of RNNs, neural networks will not be investigated further in this project. Therefore, any data generated by the RNN processing or prediction will not be stored in the 'learning_df' dataframe

stock_rnn = StockRegressor('GLD', dates=  ['2014-01-01', '2016-08-30'], n_days_to_read_ahead = 350)
stock_rnn.trainRNN()
scores = stock_rnn.score_RNN(verbose = True)
y = stock_rnn.predictRNN(start_date = stock_rnn.training_end_index, plot_prediction = True)

Getting pricing information for GLD for the period 2014-01-01 to 2017-08-15
Found a pricing file with wide range of dates, reading ... Stock-GLD-1999-01-01-2018-01-31.csv 
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_5 (LSTM)                (None, None, 50)          10400     
_________________________________________________________________
dropout_5 (Dropout)          (None, None, 50)          0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 100)               60400     
_________________________________________________________________
dropout_6 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
_________________________________________________________________
activation_3 (Activation)    (None, 1)                 0         
=================================================================
Total params: 70,901
Trainable params: 70,901
Non-trainable params: 0
_________________________________________________________________
Train on 558 samples, validate on 62 samples
Epoch 1/5
558/558 [==============================] - 2s - loss: 0.0035 - val_loss: 3.9191e-04
Epoch 2/5
558/558 [==============================] - 1s - loss: 0.0019 - val_loss: 5.4848e-04
Epoch 3/5
558/558 [==============================] - 1s - loss: 5.0586e-04 - val_loss: 0.0011
Epoch 4/5
558/558 [==============================] - 1s - loss: 6.5526e-04 - val_loss: 0.0013
Epoch 5/5
558/558 [==============================] - 1s - loss: 7.3238e-04 - val_loss: 8.0965e-04

--------------------------------------------------------------------
R^2 Score of RNN Training: 0.6024791521647193
R^2 Score of RNN Testing: -1.3898354407758151

Although for a different time range, the RNN seems to be doing a little bit better ...

stock_rnn = StockRegressor('GLD', dates= ['2010-01-01', '2014-01-01'], n_days_to_read_ahead = 350)
stock_rnn.trainRNN()
scores = stock_rnn.score_RNN(verbose = True)
y = stock_rnn.predictRNN(start_date = stock_rnn.training_end_index, plot_prediction = True)

Getting pricing information for GLD for the period 2010-01-01 to 2014-12-17
Found a pricing file with wide range of dates, reading ... Stock-GLD-1999-01-01-2018-01-31.csv 
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_7 (LSTM)                (None, None, 50)          10400     
_________________________________________________________________
dropout_7 (Dropout)          (None, None, 50)          0         
_________________________________________________________________
lstm_8 (LSTM)                (None, 100)               60400     
_________________________________________________________________
dropout_8 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
_________________________________________________________________
activation_4 (Activation)    (None, 1)                 0         
=================================================================
Total params: 70,901
Trainable params: 70,901
Non-trainable params: 0
_________________________________________________________________
Train on 860 samples, validate on 96 samples
Epoch 1/5
860/860 [==============================] - 3s - loss: 0.0056 - val_loss: 0.0014
Epoch 2/5
860/860 [==============================] - 1s - loss: 0.0023 - val_loss: 6.7456e-04
Epoch 3/5
860/860 [==============================] - 1s - loss: 0.0014 - val_loss: 6.6414e-04
Epoch 4/5
860/860 [==============================] - 1s - loss: 0.0011 - val_loss: 6.6470e-04
Epoch 5/5
860/860 [==============================] - 1s - loss: 6.5312e-04 - val_loss: 5.3317e-04

--------------------------------------------------------------------
R^2 Score of RNN Training: 0.7889911266403573
R^2 Score of RNN Testing: -1.366288211406522

Exploring Regressions

Now let's run our first regression, and see how it will look like ....

stock.trainRegression(poly_degree = 1, verbose=True)

Regression Model Coefficients of Poly degree 1: [ 0.          0.68139685]
Regression Model Intercept of Poly degree 1: 480.88633159844323

Another look at 'learning_df' ... and we can see that there's a new column called 'Linear Regression Order 1'.

Let's also plot learning_df to see how the linear regression of order 1 is doing relative to the real prices. You will notice that the regression is also predicting the first 50 "trading days" after the cutoff date delimited by the vertical dashed line, which also corresponds to the end of the training period.

A default value of 50 trading days for prediction is used throughout this project, although this can be changed at any time, since it is parametrized.

Note: At a first glance, the linear regression of order 1 is not doing great predicting the testing period prices.

display(stock.learning_df.head())
stock.plot_learning_data_frame()

	Adj Close	dates	timeline	Rolling Mean-60	Linear Regression Order 1
timeline
0	566.714111	2014-10-01	0	537.17319	480.886332
1	568.519104	2014-10-02	1	537.17319	481.567728
2	573.704895	2014-10-03	2	537.17319	482.249125
3	575.769226	2014-10-06	3	537.17319	482.930522
4	562.196472	2014-10-07	4	537.17319	483.611919

Let's add another regression of order 2 ... and see how it will look like.

Note: The linear regression of order 2 is not doing great forecasting prices either ...

stock.trainRegression(poly_degree = 2, verbose=True)
display(stock.learning_df.head())
stock.plot_learning_data_frame()

Regression Model Coefficients of Poly degree 2: [ 0.          0.15435083  0.00132757]
Regression Model Intercept of Poly degree 2: 515.6713690993323

	Adj Close	dates	timeline	Rolling Mean-60	Linear Regression Order 1	Linear Regression Order 2
timeline
0	566.714111	2014-10-01	0	537.17319	480.886332	515.671369
1	568.519104	2014-10-02	1	537.17319	481.567728	515.827048
2	573.704895	2014-10-03	2	537.17319	482.249125	515.985381
3	575.769226	2014-10-06	3	537.17319	482.930522	516.146370
4	562.196472	2014-10-07	4	537.17319	483.611919	516.310014

Adding linear regression of order 3 ...

Note: The linear regression of order 3 seems to be doing better when it comes to predicting future prices.

stock.trainRegression(poly_degree = 3, verbose=True)
display(stock.learning_df.head())
stock.plot_learning_data_frame()

Regression Model Coefficients of Poly degree 3: [  0.00000000e+00  -1.67574063e+00   1.28665856e-02  -1.93770172e-05]
Regression Model Intercept of Poly degree 3: 575.8357868560126

	Adj Close	dates	timeline	Rolling Mean-60	Linear Regression Order 1	Linear Regression Order 2	Linear Regression Order 3
timeline
0	566.714111	2014-10-01	0	537.17319	480.886332	515.671369	575.835787
1	568.519104	2014-10-02	1	537.17319	481.567728	515.827048	574.172893
2	573.704895	2014-10-03	2	537.17319	482.249125	515.985381	572.535617
3	575.769226	2014-10-06	3	537.17319	482.930522	516.146370	570.923841
4	562.196472	2014-10-07	4	537.17319	483.611919	516.310014	569.337450

Exploring Market Momentum

We will explore now the market momentum and how it can be added to our model.

Let's first see how different dates will affect the momentum regression, which is a regression based on the few days earlier than the forecast date (first forecast date shown below as the red vertical dashed line).

However, below we show that for the same forecast date/testing period, the market momentum regression will take wildly different values, depdending on the number of days used for the regression. The number of days look like a good candidate for hyper-parametrization.

def plot_momentum(stock, n_days_for_regression = 20):
    
    momentum_polyfit = np.polyfit(stock.X_train[-n_days_for_regression:], 
                                  stock.y_train[-n_days_for_regression:], 1)
            
    momentum_trend_x = np.concatenate((stock.X_train[stock.training_end_index- n_days_for_regression:],
                                       stock.X_test), axis=0)
        
    momentum_linear_reg_o1 = momentum_polyfit[1] + momentum_polyfit[0] * momentum_trend_x   
    
    prices = pd.DataFrame(stock.adj_close_price['Adj Close'][stock.training_end_index- n_days_for_regression - 60:\
                                                                            stock.testing_end_index + 20])
    prices['Momentum'] =  np.nan
    prices['Momentum'][60:-20] = momentum_linear_reg_o1

    prices.plot()
    plt.axvline(stock.training_end_index, color='r', linestyle='dashed', linewidth=2)
    plt.show()
    
plot_momentum(stock, n_days_for_regression = 20)
plot_momentum(stock, n_days_for_regression = 60)

Adding the market momentum of the last 30 trading days to the regressions, will look like the following.

stock.trainMomentum(days_for_regression = 30, verbose = True )
display(stock.learning_df[stock.training_end_index -2:stock.training_end_index + 2])
stock.plot_learning_data_frame()

	Adj Close	dates	timeline	Rolling Mean-60	Linear Regression Order 1	Linear Regression Order 2	Linear Regression Order 3	Momentum
timeline
396	691.020020	2016-04-28	396	721.881335	750.719485	784.978804	726.632958	726.248396
397	693.010010	2016-04-29	397	721.315669	751.400882	786.185919	726.021502	725.375215
398	698.210022	2016-05-02	398	NaN	752.082279	787.395690	725.389622	724.502034
399	692.359985	2016-05-03	399	NaN	752.763676	788.608115	724.737204	723.628853

An Observation ...

Running the above regressions and momentum regression countless times on multiple stocks and multiple dates, yields an interesting observation. In most cases (but not all), the real stock prices in the forecasting / testing period seem to be "bounded" by the regressions of order 1, 2, 3 and the market momentum. In the majority of cases, this holds true.

The below will illustrate this for a different stock (or commodity in this case - Gold).

Note: This is of course true given that the training period is big enough - something around 1.5 - 2 years or between 400 and 500 samples of trading days. If the training is too short, then the regressions of higher order (2 and 3) will curve sharply upwards or downwards in the testing period.

stock2 = StockRegressor('GLD', dates=['2014-10-01', '2016-06-01'])
stock2.trainRegression(poly_degree = 1, verbose=False)
stock2.trainRegression(poly_degree = 2, verbose=False)
stock2.trainRegression(poly_degree = 3, verbose=False)
stock2.trainMomentum(days_for_regression = 30, verbose=False)
display(stock2.learning_df[stock2.training_end_index -2:stock2.training_end_index + 2])
stock2.plot_learning_data_frame()

Getting pricing information for GLD for the period 2014-10-01 to 2016-10-29
Found a pricing file with wide range of dates, reading ... Stock-GLD-1999-01-01-2018-01-31.csv

	Adj Close	dates	timeline	Rolling Mean-60	Linear Regression Order 1	Linear Regression Order 2	Linear Regression Order 3	Momentum
timeline
417	115.620003	2016-05-27	417	119.389167	111.351181	118.626410	125.831826	118.947761
418	116.059998	2016-05-31	418	119.314501	111.343483	118.724663	126.143046	118.863249
419	115.940002	2016-06-01	419	NaN	111.335786	118.823423	126.457341	118.778738
420	115.669998	2016-06-02	420	NaN	111.328089	118.922692	126.774722	118.694226

Refinement

A New Approach ...

Therefore, to act on the above observation, it stands to reason to combine the regressions and market momentum regression into a new approximation, that will yield a better forecast.

After testing and trialing this, it looks like averaging the regressions of orders 1, 2, and 3, and then splitting this average with the market momentum regression seems to yield somewhat good results. The equation of the new model becomes:

New Reg/Momentum = momentum_split x market momentum + (1 - momentum_split) x average regressions(orders 1, 2, and 3)

The below will do just that. From now onwards, this approach will be labeled as the "Regression/Momentum" model.

stock.trainAverageMomentum(momentum_split = 0.4, verbose = False)
display(stock.learning_df[stock.training_end_index -2:stock.training_end_index + 2])
stock.plot_learning_data_frame()

	Adj Close	dates	timeline	Rolling Mean-60	Linear Regression Order 1	Linear Regression Order 2	Linear Regression Order 3	Momentum	Prediction Reg/Momentum
timeline
396	691.020020	2016-04-28	396	721.881335	750.719485	784.978804	726.632958	726.248396	742.965608
397	693.010010	2016-04-29	397	721.315669	751.400882	786.185919	726.021502	725.375215	742.871747
398	698.210022	2016-05-02	398	NaN	752.082279	787.395690	725.389622	724.502034	742.774332
399	692.359985	2016-05-03	399	NaN	752.763676	788.608115	724.737204	723.628853	742.673340

Exploring Fast Fourier Transform

The whole point of exploring the FFT transform was to investigate whether the price time series has some reliable fluctuation "pattern", that could be used to predict fluctuation. This is done by applying FFT on the de-trended data, and then getting the N strongest FFT frequencies. Then, similar to applying the inverse FFT, we calculate the phase and amplitude of these N frequencies, and use them on the testing period as forecast. We can then add the underlying price, to create the overall prediction.

During the training period, the underlying trend is provided by one of the regressions trained above. Usually, after many trials, the regression of order 3 performs well with the FFT.

During the testing period, we add this to the "Regression/Momentum" model trained above, which is the average of the market momentum and the avereage values of all regressions.

Note: Please note that since the FFT will need to operate on de-trended data (oscillating data around 0), then the other regressions need to be trained before using the trainFFT() function. We also need to train the momentum since the de-trended

stock.trainFFT(num_harmonics = 6, underlying_trend_poly = 3, verbose = True)
display(stock.learning_df.head())
stock.plot_learning_data_frame()

	Adj Close	dates	timeline	Rolling Mean-60	Linear Regression Order 1	Linear Regression Order 2	Linear Regression Order 3	Momentum	Prediction Reg/Momentum	Prediction w/FFT
timeline
0	566.714111	2014-10-01	0	537.17319	480.886332	515.671369	575.835787	NaN	524.131163	501.446744
1	568.519104	2014-10-02	1	537.17319	481.567728	515.827048	574.172893	NaN	523.855890	500.498907
2	573.704895	2014-10-03	2	537.17319	482.249125	515.985381	572.535617	NaN	523.590041	500.336438
3	575.769226	2014-10-06	3	537.17319	482.930522	516.146370	570.923841	NaN	523.333578	500.786457
4	562.196472	2014-10-07	4	537.17319	483.611919	516.310014	569.337450	NaN	523.086461	501.622055

Results

Model Evaluation and Validation

Predicting and Scoring

The below two functions will predict and score all the approaches, and will give also the percent change of each.

The predict() function will also yield predictions for the first day after the training period, the 7th day, the 15th day, the 22nd day, and so ... until the 50th day. These predictions are all in the testing period, and the model only takes as input only one feature, which is the date (or timeline index).

stock.score(verbose = True)
stock.predict()

--------------------------------------------------------------
R^2 Score of Linear Regression of Poly order 1 Training: 0.79
R^2 Score of Linear Regression of Poly order 1 Testing: -16.30
R^2 Score of Linear Regression of Poly order 2 Training: 0.82
R^2 Score of Linear Regression of Poly order 2 Testing: -53.52
R^2 Score of Linear Regression of Poly order 3 Training: 0.89
R^2 Score of Linear Regression of Poly order 3 Testing: -0.75
R^2 Score of Reg/Momentum Training: 0.85
R^2 Score of Reg/Momentum Testing: -3.81
R^2 Score of FFT Training: 0.93
R^2 Score of FFT Testing: -3.52
--------------------------------------------------------------


Training End date: 2016-04-30
First Day of Prediction: 2016-05-02

	Day	Index	Date	Adj Close	Reg/Mom Pred	FFT Prediction	Reg1 Pred	Reg2 Pred	Reg3 Pred	Reg/Mom Pct Var %	FFT Pct Var %	Reg1 Pct Var %	Reg2 Pct Var %	Reg3 Pct Var %
0	1	398	2016-05-02	698.21	742.77	720.09	752.08	787.40	725.39	6.38	3.13	7.72	12.77	3.89
1	8	403	2016-05-09	712.90	742.23	721.97	755.49	793.48	721.92	4.11	1.27	5.97	11.30	1.27
2	15	408	2016-05-16	716.49	741.60	722.01	758.90	799.64	717.92	3.50	0.77	5.92	11.61	0.20
3	22	413	2016-05-23	704.24	740.87	712.35	762.30	805.86	713.38	5.20	1.15	8.24	14.43	1.30
4	29	418	2016-05-31	735.72	740.04	714.54	765.71	812.15	708.28	0.59	-2.88	4.08	10.39	-3.73
5	36	422	2016-06-06	716.55	739.31	728.18	768.44	817.23	703.80	3.18	1.62	7.24	14.05	-1.78
6	43	427	2016-06-13	718.36	738.30	740.31	771.84	823.63	697.66	2.78	3.06	7.45	14.65	-2.88
7	50	432	2016-06-20	693.71	737.18	739.17	775.25	830.11	690.92	6.27	6.55	11.75	19.66	-0.40

Mean Regression/Momentum Prediction Percent Variation: +/- 4.00%
Mean FFT Prediction Percent Variation: +/- 2.56%
Mean Regression Order 1 Prediction Percent Variation: +/- 7.30%
Mean Regression Order 2 Prediction Percent Variation: +/- 13.61%
Mean Regression Order 3 Prediction Percent Variation: +/- 1.93%

Grid Search

Given that we have quite a few parameters to feed the model, it was worthwhile to implement a grid-search function, that will use these "hyper-parameters" to fine-tune them and get the highest scoring possible.

Although reliably predicting stock prices is an impossible task, but the intuition here is that every stock has a different volatility relative to the market (stock beta), different set of news, different divident payout cycles and so on. And therefore, each stock would have their own "hyper-parameters" tuned to the best of this stock only, and nothing else.

The hyper-parameters that were implemented are:

Number of harmonics: the number of highest-power harmonics to be used in the FFT model
Momentum Split: the split between the market momentum regression, and the average of all regressions
Days for Regression: the number of days to use to calculate the momentum regression, immediately preceding the forecast date
Underlying Trend Poly: this is the polynomial order of the regression used for the underlying trend used by the FFT model to "de-trend" the data and apply the FFT on the resulting residual

Default values of the hyper-parameters are:

Regressions: regressions of polynomial order 1, 2, and 3 are implemented
Number of harmonics: default is set to 4
Underlying Trend Poly of order 3 is used for the FFT de-trending
Days for Regression: the default is set to 15
Momentum Split: the default is set to 0.25

One more hyper-parameter that I haven't listed above is the "Training Delta Months": for the same combination of the above hyperparameters (harmonics, momentum split, days for regression, and poly of underlying trend), repeat the predictions as a sliding window going back in time one month at a time, before the end training period. The R^2 scores are then averaged for each combination of these hyper-parameters over the total number of months.

To illustrate all the hyper-parameters that could be set, the below stock is created, and predicted.

s = StockRegressor('GLD',  dates = ['2014-10-01', '2016-04-30'], n_days_to_read_ahead=200)
s.train(num_harmonics = 4, momentum_split = 0.4, days_for_regression = 20, underlying_trend_poly=3, 
                                       poly_degree = [1, 2, 3])
s.score(verbose=True)
s.predict()
s.plot_learning_data_frame()

Getting pricing information for GLD for the period 2014-10-01 to 2016-11-16
Found a pricing file with wide range of dates, reading ... Stock-GLD-1999-01-01-2018-01-31.csv 
Training end date is 2016-04-30, corresponding to the 398th sample
The data has 398 training samples and 140 testing samples with a total of 538 samples
Training set has 398 samples.
Testing set has 50 samples.
Regression Model Coefficients of Poly degree 1: [ 0.        -0.0150657]
Regression Model Intercept of Poly degree 1: 115.55606908425585
Regression Model Coefficients of Poly degree 2: [ 0.         -0.10540178  0.00022755]
Regression Model Intercept of Poly degree 2: 121.51825035399823
Regression Model Coefficients of Poly degree 3: [  0.00000000e+00   1.40551596e-01  -1.32322782e-03   2.60415553e-06]
Regression Model Intercept of Poly degree 3: 113.4325114891368

--------------------------------------------------------------
R^2 Score of Linear Regression of Poly order 1 Training: 0.10
R^2 Score of Linear Regression of Poly order 1 Testing: -11.29
R^2 Score of Linear Regression of Poly order 2 Training: 0.32
R^2 Score of Linear Regression of Poly order 2 Testing: -1.10
R^2 Score of Linear Regression of Poly order 3 Training: 0.63
R^2 Score of Linear Regression of Poly order 3 Testing: -8.59
R^2 Score of Reg/Momentum Training: 0.48
R^2 Score of Reg/Momentum Testing: 0.38
R^2 Score of FFT Training: 0.75
R^2 Score of FFT Testing: 0.25
--------------------------------------------------------------


Training End date: 2016-04-30
First Day of Prediction: 2016-05-02

	Day	Index	Date	Adj Close	Reg/Mom Pred	FFT Prediction	Reg1 Pred	Reg2 Pred	Reg3 Pred	Reg/Mom Pct Var %	FFT Pct Var %	Reg1 Pct Var %	Reg2 Pct Var %	Reg3 Pct Var %
0	1	398	2016-05-02	123.24	118.04	118.46	109.56	115.61	123.95	-4.22	-3.88	-11.10	-6.19	0.57
1	8	403	2016-05-09	120.65	118.76	119.44	109.48	116.00	125.61	-1.56	-1.00	-9.25	-3.86	4.12
2	15	408	2016-05-16	121.80	119.51	119.98	109.41	116.39	127.38	-1.88	-1.50	-10.17	-4.44	4.58
3	22	413	2016-05-23	119.37	120.28	120.14	109.33	116.80	129.23	0.76	0.65	-8.41	-2.15	8.26
4	29	418	2016-05-31	116.06	121.07	120.16	109.26	117.22	131.18	4.31	3.53	-5.86	1.00	13.03
5	36	422	2016-06-06	118.92	121.71	120.22	109.20	117.56	132.81	2.35	1.10	-8.17	-1.14	11.68
6	43	427	2016-06-13	122.64	122.54	120.52	109.12	118.00	134.93	-0.08	-1.73	-11.02	-3.78	10.02
7	50	432	2016-06-20	123.21	123.39	121.12	109.05	118.45	137.16	0.14	-1.70	-11.49	-3.86	11.32

Mean Regression/Momentum Prediction Percent Variation: +/- 1.91%
Mean FFT Prediction Percent Variation: +/- 1.88%
Mean Regression Order 1 Prediction Percent Variation: +/- 9.44%
Mean Regression Order 2 Prediction Percent Variation: +/- 3.30%
Mean Regression Order 3 Prediction Percent Variation: +/- 7.95%

Running the Grid Search

The below will then run the grid search over the Google stock for the same period, and see if it will beat the 4% pct variation for the Reg/Momentum model, or the 2.56% pct variation for the Reg/Momentum combined with the FFT model, with the default values of the hyper-parameters.

It is worth mentioning that the R^2 scores of the Regression/Momentum model and of the FFT model for each iteration will be saved, and then averaged, and searched for the maximum R^2 score, which will be used to find the best combination of hyper-parameters.

gs = StockGridSearch(ticker = 'GOOG', dates= ['2014-10-01', '2016-04-30'], training_delta_months = 3)
gs.train(n_days_to_predict = 50, 
         num_harmonics = [2, 4, 6, 12],
         days_for_regression = [15, 25, 35, 45],
         poly_degree = [1, 2, 3],
         underlying_trend_poly = [2, 3], 
         momentum_split = [0.2, 0.4, 0.6, 0.8])

Total Sample Size: 502 samples
Training Sample Size: 348 samples
Training Window: 347 samples
Validation Window: 50 samples
Testing Sample Size: 50 samples
Training End Date is 2016-03-31 corresponding to the 347th sample
Validation End Index 397 over range 0 - 347 with validation window 50
There are 384 combinations with 1 iterations each: total iterations is 384

Progress:
Iteration Progress: 384 / 384

Hyper-Parameters:
Harmonics Hyperparamter: 12
Days Used for Momentum Regression Hyperparamter: 45
Regression Order for Underlying Trend for FFT Hyperparamter: 3
Momentum Split for Underlying Trend for FFT Hyperparamter: 0.8

Mean R^2 Scores:
Regression of Order 1: Training 0.60 | Validation 0.13
Regression of Order 2: Training 0.88 | Validation -39.62
Regression of Order 3: Training 0.88 | Validation -27.97
Regression of Order 3 with Momentum: Training 0.87 | Validation -1.45
FFT with Underlying Trend of Regression of Order 3: Training 0.94 | Validation -1.31


Model took 217.94 seconds to train.

All mean R^2 score results are:
Regression with Momentum: -7.61
FFT: -8.92

--------------------------------------------------------------------
Best Method of Estimation is a combination of Regression of multiple orders and momentum regression with 15 days before forecast period, and with 0.4 split with momentum

--------------------------------------------------------------------
Now training new StockRegressor instance with optimal hyper-parameters.
Getting pricing information for GOOG for the period 2014-10-01 to 2016-09-27
Found a pricing file with wide range of dates, reading ... Stock-GOOG-1999-01-01-2017-09-03.csv 
Training end date is 2016-04-30, corresponding to the 398th sample
The data has 398 training samples and 104 testing samples with a total of 502 samples
Training set has 398 samples.
Testing set has 50 samples.
Regression Model Coefficients of Poly degree 1: [ 0.          0.68139685]
Regression Model Intercept of Poly degree 1: 480.88633159844323
Regression Model Coefficients of Poly degree 2: [ 0.          0.15435083  0.00132757]
Regression Model Intercept of Poly degree 2: 515.6713690993323
Regression Model Coefficients of Poly degree 3: [  0.00000000e+00  -1.67574063e+00   1.28665856e-02  -1.93770172e-05]
Regression Model Intercept of Poly degree 3: 575.8357868560126

--------------------------------------------------------------
R^2 Score of Linear Regression of Poly order 1 Training: 0.79
R^2 Score of Linear Regression of Poly order 1 Testing: -16.30
R^2 Score of Linear Regression of Poly order 2 Training: 0.82
R^2 Score of Linear Regression of Poly order 2 Testing: -53.52
R^2 Score of Linear Regression of Poly order 3 Training: 0.89
R^2 Score of Linear Regression of Poly order 3 Testing: -0.75
R^2 Score of Reg/Momentum Training: 0.85
R^2 Score of Reg/Momentum Testing: -2.17
R^2 Score of FFT Training: 0.97
R^2 Score of FFT Testing: -1.55
--------------------------------------------------------------


Training End date: 2016-04-30
First Day of Prediction: 2016-05-02

	Day	Index	Date	Adj Close	Reg/Mom Pred	FFT Prediction	Reg1 Pred	Reg2 Pred	Reg3 Pred	Reg/Mom Pct Var %	FFT Pct Var %	Reg1 Pct Var %	Reg2 Pct Var %	Reg3 Pct Var %
0	1	398	2016-05-02	698.21	732.75	725.51	752.08	787.40	725.39	4.95	3.91	7.72	12.77	3.89
1	8	403	2016-05-09	712.90	725.23	750.14	755.49	793.48	721.92	1.73	5.22	5.97	11.30	1.27
2	15	408	2016-05-16	716.49	717.62	735.73	758.90	799.64	717.92	0.16	2.69	5.92	11.61	0.20
3	22	413	2016-05-23	704.24	709.91	724.13	762.30	805.86	713.38	0.81	2.82	8.24	14.43	1.30
4	29	418	2016-05-31	735.72	702.10	730.35	765.71	812.15	708.28	-4.57	-0.73	4.08	10.39	-3.73
5	36	422	2016-06-06	716.55	695.79	726.02	768.44	817.23	703.80	-2.90	1.32	7.24	14.05	-1.78
6	43	427	2016-06-13	718.36	687.79	705.85	771.84	823.63	697.66	-4.25	-1.74	7.45	14.65	-2.88
7	50	432	2016-06-20	693.71	679.70	689.62	775.25	830.11	690.92	-2.02	-0.59	11.75	19.66	-0.40

Mean Regression/Momentum Prediction Percent Variation: +/- 2.67%
Mean FFT Prediction Percent Variation: +/- 2.38%
Mean Regression Order 1 Prediction Percent Variation: +/- 7.30%
Mean Regression Order 2 Prediction Percent Variation: +/- 13.61%
Mean Regression Order 3 Prediction Percent Variation: +/- 1.93%

Benchmarking

Regressions of polynomial orders 1 and 3, as well as ARIMA will be used to benchmark this Regress/Momentum model.

The Mean-Square-Error will be calculated over a period of 7 years, with a space of 30 days (therefore every month or 84 iterations). The MSE will be calculated by comparing the forecast over 50 days after the end of the training day, with the actual testing prices.

def pr(ticker='GOOG', dates = ['2013-01-01', '2014-09-30'], verbose = True, r = 15, ms = 0.25):
    stock = StockRegressor(ticker,  dates = dates, n_days_to_read_ahead=250, 
                               n_days_to_predict = 50, verbose = False)

    series = stock.adj_close_price.copy()
    series.index = series['dates']
    series = series.drop(labels=['dates', 'timeline'], axis=1)

    res_pd = pd.DataFrame(series)

    if verbose == True:
        print("\nTraining End Index {}".format(stock.training_end_index))
        print("Length of Learning DF {}".format(len(stock.learning_df)))

    history = list(series['Adj Close'][:stock.training_end_index])

    orig_training_end_index = stock.training_end_index

    res_pd['ARIMA'] = np.nan
    res_pd['Reg/Mom'] = np.nan
    res_pd['Reg 1'] = np.nan
    res_pd['Reg 3'] = np.nan
    predictions = list()

    itrts = 50

    for t in range(1):#itrts):
        if verbose == True:
            print("\r t {}\r ".format(t))
            clear_output(wait=True) 

        model = ARIMA(history, order=(5,1,0))
        stock.training_end_index = orig_training_end_index + t
        stock.train(no_FFT = True, keep_indexes = True, verbose=False,  days_for_regression=r, 
                                    momentum_split=ms)

        model_fit = model.fit(disp=0)
        output = model_fit.forecast(steps=itrts)
        yhat = output[0]

        mom = stock.learning_df['Momentum'][orig_training_end_index:orig_training_end_index+itrts]
        regmom = stock.learning_df['Prediction Reg/Momentum']\
                                [orig_training_end_index:orig_training_end_index+itrts]
        
        avreg = stock.reg_average[orig_training_end_index:orig_training_end_index+itrts]

        res_pd['ARIMA'][orig_training_end_index:orig_training_end_index+itrts] = yhat
        res_pd['Reg/Mom'][orig_training_end_index:orig_training_end_index+itrts] = regmom
        
    error = mean_squared_error(series['Adj Close'][orig_training_end_index:orig_training_end_index+itrts], 
                                       res_pd['ARIMA'][orig_training_end_index:orig_training_end_index+itrts])
    error2 = mean_squared_error(series['Adj Close'][orig_training_end_index:orig_training_end_index+itrts], 
                                       res_pd['Reg/Mom'][orig_training_end_index:orig_training_end_index+itrts])
    
    res_pd['Reg 1'][orig_training_end_index:orig_training_end_index+itrts] = stock.reg_pred[1]\
                                [orig_training_end_index:orig_training_end_index+itrts]
    res_pd['Reg 3'][orig_training_end_index:orig_training_end_index+itrts] = stock.reg_pred[3]\
                                [orig_training_end_index:orig_training_end_index+itrts]

    error3 = mean_squared_error(series['Adj Close'][orig_training_end_index:orig_training_end_index+itrts], 
                                   res_pd['Reg 1'][orig_training_end_index:orig_training_end_index+itrts])
    error4 = mean_squared_error(series['Adj Close'][orig_training_end_index:orig_training_end_index+itrts], 
                                   res_pd['Reg 3'][orig_training_end_index:orig_training_end_index+itrts])
        
    print('Test MSE Reg/Mom: {} - {}: {:.2f}'.format(dates[0], dates[1], error2))
    print('Test MSE ARIMA: {} - {}: {:.2f}'.format(dates[0], dates[1], error))
    print('Test MSE Reg 1: {} - {}: {:.2f}'.format(dates[0], dates[1], error3))
    print('Test MSE Reg 3 {} - {}: {:.2f}'.format(dates[0], dates[1], error4))
    
    if verbose == True:
        pyplot.rcParams["figure.figsize"] = (15,8)
        res_pd[orig_training_end_index-5:orig_training_end_index+itrts*2].plot()

    clear_output(wait=True)
    
    return error, error2, error3, error4

arima_err_lst = []
momreg_err_lst = []
reg1_err_lst = []
reg3_err_lst = []

months = 84
s_date ='2007-01-01'
date_range = []

for i in range(months):
    
    s = datetime.datetime.strptime(s_date, "%Y-%m-%d") + datetime.timedelta(days=30 * i)
    e = datetime.datetime.strptime(s_date, "%Y-%m-%d") + datetime.timedelta(days=30 * i + 365 * 2)

    arima_err, momreg_err, reg1_err, reg3_err = pr(ticker='BMY', dates = [s.strftime("%Y-%m-%d"), 
                                                                           e.strftime("%Y-%m-%d")], 
                                                   r = 15, 
                                                   ms = 0.25, 
                                                   verbose = False)
    
    date_range = [s_date, e.strftime("%Y-%m-%d")] 
    arima_err_lst.append(arima_err)
    momreg_err_lst.append(momreg_err)
    reg1_err_lst.append(reg1_err)
    reg3_err_lst.append(reg3_err)

    
print("\n--------------------------------------------------------------")  
print("Means calculated monthly over the period of {}-{} with a training window of two years"\
                                                          .format(date_range[0], date_range[1]))
print("Mean Arima MSE: {:.2f}".format(np.mean(arima_err_lst)))
print("Mean Mom/Reg MSE: {:.2f}".format(np.mean(momreg_err_lst)))
print("Mean Reg O1 MSE: {:.2f}".format(np.mean(reg1_err_lst)))
print("Mean Reg O3 MSE: {:.2f}".format(np.mean(reg3_err_lst)))
print("--------------------------------------------------------------\n")

--------------------------------------------------------------
Means calculated monthly over the period of 2007-01-01-2015-10-26 with a training window of two years
Mean Arima MSE: 4.94
Mean Mom/Reg MSE: 7.14
Mean Reg O1 MSE: 11.25
Mean Reg O3 MSE: 21.95
--------------------------------------------------------------

As shown above, the Regression/Momentum model performs much better to the Regression-only models, and performs very closely to the ARIMA model.

Conclusion

Reflection

After implementing the above models, there are multiple points worth mentioning:

It is impossible to reliably predict stock market prices, but ML can provide good approximations and forecasting models (otherwise we'll all be millionaires, or equally rich, however this will play out ... )
The above models (RNN, regressions of different orders, Regression/Momentum, and FFT) provide a good approximation, but depending on what could be considered "risk", might or might not be suitable for trading. If predicting stock market prices within plus/minus 5% is considered good enough, then the above models could be good enough.
On average, after countless runs, Regression/Momentum seems to be the best model with the best results (R^2 scores). FFT adds too much variation, as there might not be a reliable and "predictable" set of harmonics of the stock price, or a pattern of cyclicality.
Grid search doesn't always yield the best results. This is because earlier historical data in the training period is not a guaranteed way to find the optimal set of hyper-parameters. The yielded "hyper-parameters" might be optimal for the training period, but not for the testing period.
One very frequent problem with predicting stock prices, is that pricing can significantly differ from one day to the next, sometimes with complete trend reversal. This might be in response to some positive or bad news, or a terrorist attack, or a quarterly financial report of the company that differed from the analysts expectations. Therefore, no matter the ML model, this will fail.
On the positive side, one observation is that next day forecasting - the day immediately after the end of the training period - seems to be more reliably predicted than the rest of the days. This stands to reason, since the market trend and momentum are very clear before that day (although might not be too useful in the context of stock trading).
R^2 scores are mostly negative, but even though they're negative, then the scores are still sorted to look for the maximum

Improvement

Future work, and areas of improvement:

The above models could be combined with an ML model for news (sentiment analysis)
The above models could be combined with the "Fundamental Analysis" approach to yield more reliable results
The above models could be combined with the "buy or sell signals" that are commonplace among traders, like the price moving below the Bollinger band yields a buy signal for traders, therefore causing the prices to automatically go up. This could be yet another input feautre to the model.
RNN models could be investigated in-depth. It could be that RNN can uncover predictable "patterns" of sequences of pricing that FFT cannot uncover.

Quality

References

The below are references used in this project:

samerelhousseini / StockPricePrediction