Clustering Financial Time Series

This reporsitory contains the scripts and datasets used in my Master's Thesis on "Clustering Financial Time Series". The numerical experiments have been conducted using Python and R languages. One can find the necessary commands written below to reproduce the results in the thesis.

Python part

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


import python_scripts.utils as tu
from python_scripts.hier_clust import *

Here I illustrate the analysis on the Synthetic Dataset 1 and the same steps have been performed on Synthetic Dataset 2.

Synthetic Dataset 1

df, true_cluster = tu.get_synthetic_data('datasets/synthetic_data1.csv')
res_df, true_cluster = tu.get_synthetic_data('results/synthetic_data1_res.csv')

plt.rcParams['figure.figsize'] = (8.0, 4.0)
tu.plot_series(df, nr_series_per_class=5)

tu.plot_one_per_class(df, nr_series_per_class=5)

measures = ['euclidean', 'dtw', 'corr1', 'corr2',
            'cross_corr1', 'cross_corr2', 'cross_corr3',
            'acf', 'pacf', 'rccf1', 'rccf2', 'rccf3']

get_sim_index(measures, df, res_df, true_cluster)

	single	complete	average
euclidean	0.363636	0.380952	0.380952
dtw	0.380952	0.439935	0.380952
corr1	0.334127	0.243590	0.294444
corr2	0.334127	0.243590	0.243590
cross_corr1	1.000000	1.000000	1.000000
cross_corr2	1.000000	1.000000	1.000000
cross_corr3	1.000000	1.000000	1.000000
acf	0.321429	0.321429	0.321429
pacf	0.363636	0.363636	0.363636
rccf1	1.000000	1.000000	1.000000
rccf2	1.000000	1.000000	1.000000
rccf3	0.722222	0.949495	0.949495

chosen_measures = ['cross_corr2', 'cross_corr3',
                   'rccf2', 'rccf3']
cluster_numbers = list(range(2,11))

get_sil_index(chosen_measures, cluster_numbers, df, res_df)

	cross_corr2	cross_corr3	rccf2	rccf3
2	0.530485	0.739282	0.530762	0.692445
3	0.329060	0.426181	0.401696	0.477057
4	0.323237	0.420283	0.288671	0.288671
5	0.316615	0.413631	0.270898	0.270898
6	0.298892	0.404116	0.228582	0.228582
7	0.279041	0.384262	0.229174	0.229174
8	0.279625	0.384827	0.225471	0.225471
9	0.282080	0.387268	0.184505	0.184505
10	0.095913	0.095913	0.186782	0.186782

clustering = HClust(df, true_cluster, 'cross_corr3')
clustering.plot_dendrogram()

clustering1 = HClust(df, true_cluster, 'cross_corr2')
clustering2 = HClust(df, true_cluster, 'cross_corr3')

fig = plt.figure(figsize=(20, 8))
fig.subplots_adjust(hspace=0.4, wspace=0.02)

fig.add_subplot(1, 2, 1)
clustering1.plot_heatmap(xlab='$d_{CCF_2}$')

fig.add_subplot(1, 2, 2)
clustering2.plot_heatmap(xlab='$d_{CCF_3}$')

Clustering Stock Prices

stock_data = pd.read_csv('datasets/nyse_data.csv, index_col=0)
clustering = HClust(data=stock_data, ground_truth=None,
                     dist_func='cross_corr3', verbose=True)
clustering.dist_mat.to_csv('results/diss_mat_ccf3.csv', index=None, header=None)

Dissimilarity computation: 100% [-------------------------------] Time: 0:32:53

diss_mat_ccf2 = pd.read_csv('results/diss_mat_ccf2.csv', header=None, index_col=None)
diss_mat_ccf3 = pd.read_csv('results/diss_mat_ccf3.csv', header=None, index_col=None)

diss_ccf2 = np.triu(diss_mat_ccf2.values,1).flatten()
diss_ccf3 = np.triu(diss_mat_ccf3.values,1).flatten()
plt.hist(diss_ccf2[diss_ccf2!=0], bins=50, alpha=0.5, label='${CCF_2}$')
plt.hist(diss_ccf3[diss_ccf3!=0], bins=50, alpha=0.5, label='${CCF_3}$')
plt.xlabel('Dissimilarity')
plt.ylabel('Frequency')
plt.legend(fontsize=14);

sectors = pd.read_csv('datasets/sectors.csv', header=None)

Converting sector categories into numeric values.

from sklearn.preprocessing import LabelEncoder
sectors_numeric = LabelEncoder().fit_transform(sectors)

get_similarities(diss_mat_ccf2, ground_truth=sectors_numeric)

	single	complete	average
	0.087304	0.325193	0.145864

permutations = pd.read_csv('results/null_distribution1.csv')
permutations.plot(kind='hist', bins=50, legend=False)
s0 = 309
plt.plot(s0, 0.5, 'ro')
plt.arrow(s0, 20, 0, -16, length_includes_head=True,
          head_width=5, head_length=4)
plt.xlabel('Number of pure edges');

R part

source('R_scripts/clustering.R')
source('R_scripts/mst.R')

df <- read.csv('datasets/synthetic_data1.csv')
nr.series <- 5
true_cluster <- c(rep(1, nr.series), rep(2,nr.series),
                  rep(1, nr.series), rep(2, nr.series))

This snippet will return the similarity index of the clustering using Piccollo and Maharaj distances with single linkage method. Optionally one can plot the resulting dendrograms by setting plot=TRUE. Here I have used the TSclust package.

cluster_eval(df, dist.method='AR.PIC',
             linkage.method = 'single',
             true_cluster, plot=F)
             
cluster_eval(df, dist.method='AR.MAH',
             linkage.method = 'single',
             true_cluster, plot=F)

This snippet fits an AR or ARIMA(p,1,0) type of model on each time series in the datasets and returns the model residuals for RCCF dissimilarity measure.

res.df <- get.residuals(df)

write.csv(res.df, file="results/synthetic_data1_res.csv",
          row.names = F)

Plotting the minimum spanning tree with threshold 1 on the edge weights.

diss_data <- read.csv('results/diss_mat_ccf3.csv', header=F)
sectors <- read.csv('datasets/sectors.csv', header=F)
sectors <- sectors$V1
sectors.numeric <- as.numeric(sectors)

gr <- graph.adjacency(as.matrix(diss_data),
                      mode='undirected',
                      weighted = T)
mstree <- igraph::mst(gr)

final_mstree <- plot.mst(mstree, names=sectors.numeric,
                         pallete='Paired',
                         threshold=1,
                         save.fig = F, 
                         fig.size=c(6,4))

Running the permutation test will result in the p-value of the test and the resulting permutations will be saved as a csv file.

permutation.test(final_mstree, 10^4)

Plotting the histogram of clusters in order to see the alignment between the obtained clusters and the provided categories (sectors).

df <- read.csv('results/clusters_complete_3.csv')
df$clusters <- df$clusters + 1
df$gt_num <- as.character(as.numeric(df$gt))

plot.hist(df, save.fig=T, fig.size=c(6,4))

NshanPotikyan / ClusteringFTS

Clustering Financial Time Series

Python part

Synthetic Dataset 1

Clustering Stock Prices

R part

About

Languages