This repository explores whether stock returns can be accurately predicted based on CNN image classification.
All code is located in the /stock_return_cnn folder.
The following packages are required to run all the code from this repository:
- pandas
- os
- PIL
- matplotlib
- random
- datetime
- gc
- pyts (to construct the GADF images)
- numpy
- optuna (to perform the hpo)
- torch (the models are trained using the Pytorch framework)
- torchvision
- pickle (to save intermediare results)
- sklearn (for the benchmark model)
The data used for this project is made available under CC0 and can be found at: https://www.kaggle.com/datasets/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
To ensure that the project runs correctly, download the data from the above link and populate the stock_return_cnn/Data
folder so that it contains the ETFs
and Stocks
folders from the downloaded data.
The notebooks are numbered in the order that they should be executed. Notebooks 0 and 1 have to do with turning the raw data into image files and performing some basic EDA. As this is a step that takes a lot of time (depending on the machine a few hours to a full day), a folder with already processed images is provided.
- The folder "ModelData", contains the .csv files that outline where the train, test, & validation data can be found as well as the labels.
- The folder "stock_return_cnn/SampleData.zip", contains all the area chart and GADF image files. If you decide to skip over running notebooks 0 & 1 than you simply have to UNZIP AND THEN RENAME /SampleData to /Charts and then you should be able to run notebooks 2 and beyond without issue.
Notebooks 2_a and 2_b contain the code for the hyperparameter tuning. Again, this code takes a long time to run (On my machine it took around 12 hours). Therefore, these notebooks can be skipped as the result of the hyperparameter tuning is already incorporated into notebooks 3_a and 3_b.
Notebooks 3_a and 3_b contain the code to train the final area chart and GADF image models and perform some analysis on the trianing and results.
Lastly, notebook 4 contains the code to train the benchmark model.