An implementation of selected machine learning algorithms for basic natural language processing in golang. The initial focus for this project is Latent Semantic Analysis to allow retrieval/searching, clustering and classification of text documents based upon semantic content.
Built upon gonum/matrix with some inspiration taken from Python's scikit-learn.
Check out the companion blog post or the go documentation page for full usage and examples.
- Sparse matrix implementations for more effective memory usage
- Convert plain text strings into numerical feature vectors for analysis
- Stop word removal to remove frequently occuring English words e.g. "the", "and"
- Feature hashing implementation ('the hashing trick') (using MurmurHash3)for reduced memory requirements and reduced reliance on training data
- TF-IDF weighting to account for frequently occuring words
- LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
- Cosine similarity implementation to calculate the similarity (measured in terms of difference in angles) between feature vectors.
- Pipelining of transformations to simplify usage e.g. vectorisation -> tf-idf weighting -> truncated SVD
- Ability to persist trained models
- LDA (Latent Dirichlet Allocation) implementation for topic extraction
- Stemming to treat words with common root as the same e.g. "go" and "going"
- Querying based on multiple query strings (using their centroid) rather than just a single query string.
- Support partitioning for the Latent Semantic Index (LSI)
- Clustering algorithms e.g. Heirachical, K-means, etc.
- Classification algorithms e.g. SVM, random forest, etc.