markolalovic / tda-digits

Topological features applied to the digits data set

Home Page:https://markolalovic.github.io/tda-digits/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Topological features applied to the MNIST data set

Extraction of topological features that can be used as an input to standard algorithms to obtain qualitative geometric information

Intro figure

Introduction

This repository contains the source code for a tutorial on application of computational topology in machine learning. To illustrate the use of persistent homology in machine learning we apply it to the MNIST data set of handwritten digits.

You can find the blog post here or check the interactive example here.

Description

The main problem we are trying to solve is how to extract the topological features that can be used as an input to standard machine learning algorithms. We will use a similar approach as described in [1].

From each image, we first construct a graph, where pixels of the image correspond to vertices of the graph and we add edges between adjacent pixels; see Figure A and Figure B. We then extract 0- and 1-dimensional topological features called Betti numbers. For example, a torus has one connected component so first Betti number is 1, and two cycles or loops so second Betti number is 2; see Figure C.

A pure topological classification cannot distinguish between individual numbers, as the numbers are topologically too similar. For example, numbers 6 and 9 are topologically the same if we use this style for writing numbers. Persistent homology, however, gives us more information.

The persistent homology was computed using computational topology package called Dionysus 2, for more see the package documentation [2].

How-to

Dependencies:

  • Python (2 or 3);
  • Dionysus 2 for computing persistent homology;
  • Boost version 1.55 or higher for Dionysus 2;
  • NumPy for loading data and computing;
  • Scikit-learn for machine learning algorithms;
  • Scikit-image for image pre-processing;
  • Matplotlib for plotting;
  • Networkx for plotting graphs.

To get the data run scripts / prepare_data.py:

$ cd scripts
$ python3 prepare_data.py

This script downloads and saves 10000 images of digits to numpy arrays X_10000.npy and y_10000.npy in data directory.

To extract the features, run src / tda_digits.py:

$ cd src
$ python3 tda_digits.py

This generates the figures for digit 8 that you can find in example directory.

For details on how to use the functions and classes see the Jupyter notebooks: Example.ipynb and Classification.ipynb that are in the scripts directory.

References

[1] Aaron Adcock, Erik Carlsson, Gunnar Carlsson, "The Ring of Algebraic Functions on Persistence Bar Codes", Apr 2013. https://arxiv.org/abs/1304.0530

[2] Dmitriy Morozov, "Dionysus 2 documentation". https://mrzv.org/software/dionysus2/

About

Topological features applied to the digits data set

https://markolalovic.github.io/tda-digits/


Languages

Language:Python 63.9%Language:JavaScript 28.7%Language:HTML 2.6%Language:Asymptote 2.1%Language:Shell 1.3%Language:CSS 1.3%