Deploying a dockerized API to predict the daily average sentiment of financial news articles

1. Objetives

1.1. Build and host a predictive model on AWS with Python

1.2. Using a dataset of news articles, train a model to predict the average sentiment of the next day.

1.3. Host the model within a free-tier instance on AWS.

1.4. Build an endpoint that accepts parameters from a user and returns a time series of average sentiment values with the final value as a prediction from the model.

2. Solution

2.1. Postgresql stores the sentiment of the articles, summaries, the articles itslef and categorization by several topics.

2.2 The endpoint accepts parameters from the user in a request like the following.Those parameters are the input for an ARIMA model

{"hold out samples": 20, "lag observations": 3, "degree of differencing": 0, "moving average window": 0}

The user can change any of these parameters but be aware that some combinations are computationally expensive, it is just a free-tier ec2 instance.

2.3. After accept the inputs, the API script (api.py) performed a rolling forecast to re-create the ARIMA model after each new observation is received. Therefore, the model able to adapt to new data easily.

2.4.This walk-forward validation is performed in the hold out samples and then finally predict the average sentiment of the articles for the next day.

3. Results and Pipeline

You can test by using postman at:

http://ec2-54-79-143-239.ap-southeast-2.compute.amazonaws.com/API/PREDICT_AVG_SENTIMENT

The endpoint accepts parameters from the user in a request like the following.

{"hold out samples": 20, "lag observations": 3, "degree of differencing": 0, "moving average window": 0}

3.1. Why arima?

All models were tested with a hold out samples (33% of the dataset).

Even tough Regularized Regression such as Ridge performed slightly better than ARIMA models, I picked ARIMA model because it can be adapted easily to new data by incorporating each new observation into the model (Autoregressive models have worked better (>1,0,0))

Model	RMSE
Persistence(Baseline)	0.124
Autoregressive (X,0,0)	0.098
ARIMA(X,X,0)	0.11
Linear Regression	8*10>
Lasso Regression	0.094
Ridge Regression	0.087
Decision Tree Regression	0.11
XGB Regressor	0.11
Univariate LSTM	0.092

Please consider that the cells related to LSTM approach will not work in the container environment because I did not install TensorFlow there. I developed the LSTM approach in my local environment, just for time convenience.

3.2.Pipeline

Trial1 notebook has all the details about connection to the database, EDA, basic feature engineering and performance and experiment of these models

Some dataframes were inspected by profiling pandas library.There are two html outpus for this purpose. The bigger one couldn´t be uploaded here, but you can pull the images from DockerHub to access it

https://hub.docker.com/repository/docker/robeespi/roblast27

Some EDA activities and basic feature engineering techniques explored:

Pandas profiling ( They are in the docker container as output2.html and output3.html, output2 is the EDA about the sql query and output3 is the dataframe by grouping the timestamp by day and incorporating category and sector as dummy variables)
Lag plots
Autocorrelation plots
Plotting Distribution response variable vs variables in the dataset
Correlations
Category and Sector as a dummy variable to run regressions
Feature Importance performance but not conclusive at all
There are three timestamps on the data, but I picked the timestamp with more distinct observations and longer period of time.

4. Future Work

LSTM univariate approach shows good performance, but showing overfitting. Still room for find suitable hyperparameters. AutoML/DL and/or Multivariate approach by ussing attention mechanism will be explored

robeespi / Deploying-on-AWS-a-dockerized-API-to-predict-the-daily-average-sentiment-of-financial-news-articles