Prof. Gabriele Tolomei
MSc in Computer Science
La Sapienza, University of Rome
Author: Corsi Danilo (1742375) - corsi.1742375@studenti.uniroma1.it
- In this project I’ve decided to build a Bitcoin price forecasting model in order to see if it possible to make predictions about the price of Bitcoin using machine learning methods
- I will first introduce what bitcoin is and what is the aim of this project
- Next we will see what data will be used and how to achieve the goal
- Followed by a description of the main stages of the project
- And finally draw the final conclusions
- Bitcoin is a decentralized cryptocurrency, created in 2009 by an anonymous inventor under the pseudonym of Satoshi Nakamoto
- It does not have a central bank behind it but relies on a network of nodes that manage it in a distributed, peer-to-peer mode
- It uses strong cryptography to validate and secure transactions
- These can be made through the Internet to anyone with a bitcoin address
- And are contained in a public ledger of which is constantly updated and validated by nodes in the network
- It’s value is determined by the market and the number of people using it
- This criptocurrency has attracted the attention of many people in recent years, however, it's price fluctuation can be extremely unpredictable
- In this context, predicting Bitcoin prices can be a competitive advantage for investors and traders, as it could allow them to make informed decisions on the right time to enter or exit the market
Analyze some machine learning techniques to understand, through the processing of historical data, how accurately the price of Bitcoin can be predicted and whether this can provide added value to cryptocurrency investors and traders
⚠️ Note: Because of the large size of the notebooks with the outputs containing the plots, it was not possible for me to upload them to the E-Learning / GitHub platforms, below are links to the notebooks with the outputs viewable using Colab
-
Block splitting:
3.1. Linear Regression
-
Walk forward splitting:
-
Single splitting:
-
I collected Bitcoin blockchain data using the API of the Blockchain.com website and price information from two popular exchanges, Binance and Kraken
-
I decided to retreieve the most relevant data from the last four years to current days, a period for which there were moments of high volatility but also some price lateralization
-
The features taken under consideration were divided into several categories, from those that describe the price characteristics to those that go into more detail about Bitcoin's blockchain:
-
Currency Statistics
ohlcv:
stands for “Open, High, Low, Close and Volume” and it's a list of the five types of data that are most common in financial analysis regarding price.market-price:
the average USD market price across major bitcoin exchanges.trade-volume-usd:
the total USD value of trading volume on major bitcoin exchanges.total-bitcoins:
the total number of mined bitcoin that are currently circulating on the network.market-cap:
the total USD value of bitcoin in circulation.
-
Block Details
blocks-size:
the total size of the blockchain minus database indexes in megabytes.avg-block-size:
the average block size over the past 24 hours in megabytes.n-transactions-total:
the total number of transactions on the blockchain.n-transactions-per-block:
the average number of transactions per block over the past 24 hours.
-
Mining Information
hash-rate:
the estimated number of terahashes per second the bitcoin network is performing in the last 24 hours.difficulty:
a relative measure of how difficult it is to mine a new block for the blockchain.miners-revenue:
total value of coinbase block rewards and transaction fees paid to miners.transaction-fees-usd:
the total USD value of all transaction fees paid to miners. This does not include coinbase block rewards.
-
Network Activity
n-unique-addresses:
the total number of unique addresses used on the blockchain.n-transactions:
the total number of confirmed transactions per day.estimated-transaction-volume-usd:
the total estimated value in USD of transactions on the blockchain.
-
- The project is structured in this way
- First, I retrieved all the data and processed them in order to decide how to use the features
- Then different models are trained using different methods of splitting the dataset, which we will see later
- And then the final results are collected and conclusions are drawn
- The project was carried out with Apache Spark but during some phases I converted the Spark dataframe to a Pandas one to make some plots
Features
-
After obtaining all the data, other features were added such as:
next-market-price:
that represents the price of Bitcoin for the next day, on which predictions will be madesimple-moving-averages:
indicators that calculate the average price over a specified number of days
-
Then all the features have been divided into three distinct final groups:
Base features:
contains all the price featuresBase + most / least correlated features:
contains the previous ones plus the additional blockchain features divided based on their correlation value with the price- If this value is greater than equal to 0.6 they will be considered most correlated, least correlated otherwise
Splitting
- Then the whole dataset will be splitted into two sets:
Train / Validation set:
that will be used to train the models and validate the performancesTest set:
that will be used to perform price prediction on never-before-seen data, in this case the last 3 months of the original dataset will be used
Splitting methods
- Three different splitting methods were used to train and validate the models in order to figure out which one works best for this problem
- In the latter case I consider only 2 years instead of 4 as in the others, so as to best benefit from the trend in the short term
Models and metrics
-
Several types of regression algorithms between linear and tree-based will be tested to see their differences:
Linear Regression
Generalized Linear Regression
Random Forest Regressor
Gradient Boosting Tree Regressor
-
Different types of metrics will be used to get a complete picture of the performance of the various models, including:
RMSE (Root Mean Squared Error)
MSE (Mean Squared Error)
MAE (Mean Absolute Error)
MAPE (Mean Absolute Percentage Error)
R2 (R-squared)
Adjusted R2
Accuracy
-
Since predicting the price accurately is very difficult, I tried to compute how good the models are at predicting whether the price will go up or down like this:
- For each prediction, I am going to consider it correct if the actual price goes up or down and the predicted price follows that trend, wrong if vice versa
- After that I count the number of correct predictions among all of them
- And finally I compute the overall percentage of accuracy
Pipeline
- Concern the train / validation pipeline, it is structured like this:
-
First of all, I saw how the
default models
behave with the three feature groups and applying normalisation to them or not -
Then the features that for each model gave the most satisfactory results are chosen and proceed with the
hyperparameter tuning
to find the best model’s parameters to use -
Since during this stage will be used the Block split or Walk forward split method of the dataset I compute a score for each set of parameters chosen by each split, assigning weights based on their
frequency of occurrence
,split belonging
andRMSE value
-
Then, the overall score will be calculated by putting together these weights for each set of parameters and the one with the best score will be the chosen one
-
After that, the performance of each model is validated by performing
cross validation
-
And if the final results are satisfactory, the models will be trained on the whole train / validation set and saved in order to make predictions on the test set
-
-
On this last phase, all results obtained up to that point are compared and final predictions on the test set are made
-
This has been divided into further mini-sets of to see how the models performance degrades as time increases
-
Block splitting:
3.1. Linear Regression
-
Walk forward splitting:
-
Single splitting:
.
|-- README.md
|-- datasets
| |-- output
| | |-- bitcoin_blockchain_data_15min_test.parquet
| | `-- bitcoin_blockchain_data_15min_train_valid.parquet
| |-- raw
| | `-- bitcoin_blockchain_data_15min.parquet
| `-- temp
|-- features
| |-- base_and_least_corr_features.json
| |-- base_and_most_corr_features.json
| `-- base_features.json
|-- models
| |-- GeneralizedLinearRegression
| |-- GradientBoostingTreeRegressor
| |-- LinearRegression
| `-- RandomForestRegressor
|-- notebooks
| |-- 1-data-crawling.ipynb
| |-- 2-feature-engineering.ipynb
| |-- 3-block-split_GeneralizedLinearRegression.ipynb
| |-- 3-block-split_GradientBoostingTreeRegressor.ipynb
| |-- 3-block-split_LinearRegression.ipynb
| |-- 3-block-split_RandomForestRegressor.ipynb
| |-- 4-walk-forward-split_GeneralizedLinearRegression.ipynb
| |-- 4-walk-forward-split_GradientBoostingTreeRegressor.ipynb
| |-- 4-walk-forward-split_LinearRegression.ipynb
| |-- 4-walk-forward-split_RandomForestRegressor.ipynb
| |-- 5-single-split_GeneralizedLinearRegression.ipynb
| |-- 5-single-split_GradientBoostingTreeRegressor.ipynb
| |-- 5-single-split_LinearRegression.ipynb
| |-- 5-single-split_RandomForestRegressor.ipynb
| |-- 6-final-scores.ipynb
| `-- images
| |-- Drawings.excalidraw
| |-- block-splits.png
| |-- single-split.png
| `-- walk-forward-splits.png
|-- presentation
| |-- presentation.pptx
|-- requirements.txt
|-- results
| |-- block_splits
| | |-- GeneralizedLinearRegression_accuracy.csv
| | |-- GeneralizedLinearRegression_all.csv
| | |-- GeneralizedLinearRegression_rel.csv
| | |-- GradientBoostingTreeRegressor_accuracy.csv
| | |-- GradientBoostingTreeRegressor_all.csv
| | |-- GradientBoostingTreeRegressor_rel.csv
| | |-- LinearRegression_accuracy.csv
| | |-- LinearRegression_all.csv
| | |-- LinearRegression_rel.csv
| | |-- RandomForestRegressor_accuracy.csv
| | |-- RandomForestRegressor_all.csv
| | `-- RandomForestRegressor_rel.csv
| |-- final
| | |-- final.csv
| | `-- plots
| | |-- default_train_val_r2.png
| | |-- default_train_val_r2_non_negative.png
| | |-- default_train_val_rmse.png
| | |-- final_test_accuracy.png
| | |-- final_test_fifteen_days_prediction.png
| | |-- final_test_one_month_prediction.png
| | |-- final_test_one_week_prediction.png
| | |-- final_test_r2.png
| | |-- final_test_r2_non_negative.png
| | |-- final_test_rmse.png
| | |-- final_test_three_months_prediction.png
| | |-- final_train_val_accuracy.png
| | |-- final_train_val_r2.png
| | |-- final_train_val_r2_non_negative.png
| | `-- final_train_val_rmse.png
| |-- single_split
| | |-- GeneralizedLinearRegression_accuracy.csv
| | |-- GeneralizedLinearRegression_all.csv
| | |-- GeneralizedLinearRegression_rel.csv
| | |-- GradientBoostingTreeRegressor_accuracy.csv
| | |-- GradientBoostingTreeRegressor_all.csv
| | |-- GradientBoostingTreeRegressor_rel.csv
| | |-- LinearRegression_accuracy.csv
| | |-- LinearRegression_all.csv
| | |-- LinearRegression_rel.csv
| | |-- RandomForestRegressor_accuracy.csv
| | |-- RandomForestRegressor_all.csv
| | `-- RandomForestRegressor_rel.csv
| `-- walk_forward_splits
| |-- GeneralizedLinearRegression_accuracy.csv
| |-- GeneralizedLinearRegression_all.csv
| |-- GeneralizedLinearRegression_rel.csv
| |-- GradientBoostingTreeRegressor_accuracy.csv
| |-- GradientBoostingTreeRegressor_all.csv
| |-- GradientBoostingTreeRegressor_rel.csv
| |-- LinearRegression_accuracy.csv
| |-- LinearRegression_all.csv
| |-- LinearRegression_rel.csv
| |-- RandomForestRegressor_accuracy.csv
| |-- RandomForestRegressor_all.csv
| `-- RandomForestRegressor_rel.csv
`-- utilities
|-- config.py
|-- feature_engineering_utilities.py
|-- final_scores_utilities.py
|-- imports.py
|-- train_validation_utilities.py
bitcoin_blockchain_data_15min_test.parquet:
dataset used in the final phase of the project to perform price prediction on never-before-seen databitcoin_blockchain_data_15min_train_validation.parquet:
dataset used to train and validate the modelsbitcoin_blockchain_data_15min.parquet:
original dataset obtained by making calls to the APIs
base_and_least_corr_features.json:
contains the name of the currency features plus the least relevant features with respect to the price of Bitcoinbase_and_most_corr_features.json:
contains the name of the currency features plus the most relevant features with respect to the price of Bitcoinbase_features.json:
contains the name of the currency features of Bitcoin
- Each folder (
GeneralizedLinearRegression
,GradientBoostingTreeRegressor
,LinearRegression
andRandomForestRegressor
) contains the trained model with the best parameters, ready to be used to perform price prediction on never-before-seen data
1-data-crawling.ipynb:
crawling data on Bitcoin's price and blochckain by querying APIs2-feature-engineering.ipynb:
adding useful features regardings the price of Bitcoin, visualizing data and performing feature selection3-5-<splitting-method>_<model>.ipynb:
it performs training/validation of models according to the chosen split method (block split, walk forward split or single split)6-final-scores.ipynb:
display the final scores and making predictions on the test set with the models trained on the whole train / validation set
- Based on the splitting method, results regarding metrics and accuracy are collected (including the final ones).
config.py
contains global variables that can be used throughout the projectfeature_engineering_utilities.py:
contains the methods used in the feature engineering notebookfinal_scores_utilities.py:
contains the methods used in the notebook of final scoresimports.py:
contains imports of external librariestrain_validation_utilities.py:
contains the methods used in the notebooks where models are trained and validated