Natural Language Processing

An implementation of selected machine learning algorithms for basic natural language processing in golang. The initial focus for this project is Latent Semantic Analysis to allow retrieval/searching, clustering and classification of text documents based upon semantic content.

Built upon gonum/matrix with some inspiration taken from Python's scikit-learn.

Check out the companion blog post or the go documentation page for full usage and examples.

Features

Sparse matrix implementations for more effective memory usage
Convert plain text strings into numerical feature vectors for analysis
Stop word removal to remove frequently occuring English words e.g. "the", "and"
Feature hashing implementation ('the hashing trick') (using MurmurHash3)for reduced memory requirements and reduced reliance on training data
TF-IDF weighting to account for frequently occuring words
LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
Cosine similarity implementation to calculate the similarity (measured in terms of difference in angles) between feature vectors.

Planned

Pipelining of transformations to simplify usage e.g. vectorisation -> tf-idf weighting -> truncated SVD
Ability to persist trained models
LDA (Latent Dirichlet Allocation) implementation for topic extraction
Stemming to treat words with common root as the same e.g. "go" and "going"
Querying based on multiple query strings (using their centroid) rather than just a single query string.
Support partitioning for the Latent Semantic Index (LSI)
Clustering algorithms e.g. Heirachical, K-means, etc.
Classification algorithms e.g. SVM, random forest, etc.

References

About

Selected Machine Learning algorithms for basic natural language processing in Golang

MIT License

Languages

Language:Go 100.0%