This project aims to demonstrate how various distributed data and parallel processing services could be used for financial analysis.
An example application is provided, which uses the daily close prices of a stock to generate a linear regression model. This model is then used to predict the price of a stock at a given future date.
Docker is used to easily deploy containers (learn more). Cross-platform installation instructions are available at docs.docker.com/get-docker.
Docker Compose is used to define and run multiple containers together, managing our stack of services and applications. You will likely have it already installed if you installed Docker on Windows or Mac (check with docker-compose --version
). Installation instructions are available at docs.docker.com/compose/install.
We recommend Linux users add themselves to the docker
group so as to avoid entering sudo
for every desired command. Instructions are available at docs.docker.com/engine/install/linux-postinstall.
Python is a programming language we use to build part of our stack. Cross-platform installation instructions are available at realpython.com/installing-python/.
Note we use version 3 of Python. Make sure you have Python 3 by checking the version of your installation like so:
$ python --version
Python 3.X.X
Git is used for version control (learn more). Cross-platform installation instructions are available at git-scm.com/downloads.
To clone this repository in a bash shell (e.g. Git Bash), use the following command:
git clone github.com/Joshgallagher/financial-analysis-stack.git && cd financial-analysis-stack
See instructions for cloning repositories at help.github.com.
Simply run the following command to get the stack up and running:
docker-compose up
This will tail all the services logs in your shell. If you want to avoid this, add the option -d
. To ensure the stack is running when using the -d
option, use docker-compose ps
where you should see all services as status as Up
(except for the app
, which runs periodically as jobs).
The images for each service will automatically download on the first run.
We provide an app
service that creates a linear regression model for a stock's full history, and estimates a price for the stock at a user-provided date. The mechanisms for this container is provided in the app/
folder.
To run an app job, i.e. estimate a price at a given date, you can the estimate_stock_price.py
script we provided, located in the repositories root folder.
usage: estimate_stock_price.py [-h] symbol date
positional arguments:
symbol Stock listing to model for
date Predict stock's price at <yyyy-mm-dd>
optional arguments:
-h, --help show this help message and exit
For example, if you wanted to find an estimation for the value of GOOGL
(Google) on April 4th, 2020, you can run the following in your shell environment:
./estimate_stock_price.py GOOGL 2021-04-10
The output should looking something like this:
Running docker-compose up -d --no-deps app
Recreating spark_app_1 ... done
Running docker-compose logs -f --no-color app
Connecting to Redis
Connecting to Hive
Connecting to Spark
SELECT name FROM symbol_descriptions WHERE symbol="GOOGL"
Finding stock history of Alphabet Cl A (GOOGL)
SELECT date_, close FROM stocks WHERE symbol="GOOGL"
Transforming data for modeling
Creating model
Successfully built linear regression model
Coefficient: 0.184534
Intercept: -135076.886742
RMSE: 125.296515
r2: 0.838370
Inputting Apr 10 2021 into generated model
Estimated price of GOOGL at Apr 10 2021: $1,088.60
If the input is valid, an estimated price should be returned. For example, we can see that our arguments gave us a price of $1,088.60
.
We include a web application in our stack which allows you to see charts of all the stocks that have been run through our stock prediction application. A list of all symbols that have been queried can be found on the homepage, which link to a page plotting the respective stocks history.
The web application can be accessed by going to localhost:3000
in your browser.
Our project automatically loads data on start-up, which has to be supplied by the developer beforehand.
A folder called "data" is located in the repository. Two files are included, stock_histories.csv.gz
and symbol_descriptions.txt
, which correspond to the stocks
and symbol_descriptions
tables respectively in our Hive database.
.
└── data
├── stock_histories.csv.gz
└── symbol_descriptions.txt
We preprocessed the raw data before loading it with Hive. See /hive/hive_setup.sql to see how we need the data formatted so that it can load.
We use market data collected using Redtide by @qks1lver. You can view and download a nearly-up-to-date version at kaggle.com/qks1lver/amex-nyse-nasdaq-stock-histories.
We concatenated the AMEX.txt
, NASDAQ.txt
and NYSE.txt
files to make symbol_descriptions.txt
, removing repeats of the first row.
For the stock histories, we first merged all the full_history/<symbol>.csv
files together, with the <symbol>
in the filename used as a new column value. This was achieved with a Python script, available at /scripts/merge_stock_histories.py to make stock_histories.csv
. Having one file dramatically reduced the time it took Hive
to load our data. Note that was only the case due to how we stored our data.
We then compressed the generated file using gzip
to make stock_histories.csv.gz
. We load from a compressed file because we want to distribute the sample dataset in a smaller file.
We have licensed everything that we wrote for this stack as MIT :)