By Ivan Sekulic and Rob Romijnders
Get the code from remote
git clone https://github.com/RobRomijnders/ts_clust.git
cd ts_clust
Set up the python virtual environment
pipenv install
pipenv shell
Run the following scripts:
main.py
for training the auto encoder and saving the model inlog_tb/
. During training, the script prints three losses. If all goes well, they should all three at least go below 1.0.extract_representations.py
for extracting the representations for a given data set. This loads a saved model from thelog_tb/
directory. This script saves the representations inoutput_rep/
plot_representations.py
a small script to plot the representations inoutput_rep/
. This script also shows how to use the representations for other uses.
Download the data from the UCR time series data base
Put the data in the ts_clust/data/UCR_datasets
directory.
In this project, we set out to visualize and cluster time series. We have many many time series, but they have no label and we don't know how they relate to eachother. An auto encoder can compress the complex time series to a small latent code. A latent code captures all the variablility in each time series.
Another problem is to cluster the data. In our big pile of time series, some time series naturally cluster together. These time series are similar in some semantic sense. Therefore, we run a clustering algorithm on our latent codes. As the latent codes capture the variability in the time series. A cluster of latent codes will represent a semantically similar set of time series.
In this project, we make the following contributions
- We present an auto encoder structure to compress time series. We use a compression loss as defined in the Wasserstein Auto encoder
- We visualize some of the UCR time series and show how the latent codes of the auto encoder capture semantics of the time series.
- We employ a clustering algorithm on the latent code and show the accuracy of these clusters against ground truth labels. (Note that our entire model is unsupervised)
TODO: mention somewhere that we use a few small data sets. Maybe mention the sample size for each data set that we use?