outkaj / nlp

Selected Machine Learning algorithms for basic natural language processing in Golang

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Natural Language Processing

License: MIT GoDoc wercker status Go Report Card

nlp

An implementation of selected machine learning algorithms for basic natural language processing in golang. The initial focus for this project is Latent Semantic Analysis to allow retrieval/searching, clustering and classification of text documents based upon semantic content.

Built upon gonum/matrix with some inspiration taken from Python's scikit-learn.

Check out the companion blog post or the go documentation page for full usage and examples.


Features

  • Sparse matrix implementations for more effective memory usage
  • Convert plain text strings into numerical feature vectors for analysis
  • Stop word removal to remove frequently occuring English words e.g. "the", "and"
  • Feature hashing implementation ('the hashing trick') (using MurmurHash3)for reduced memory requirements and reduced reliance on training data
  • TF-IDF weighting to account for frequently occuring words
  • LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
  • Cosine similarity implementation to calculate the similarity (measured in terms of difference in angles) between feature vectors.

Planned

  • Pipelining of transformations to simplify usage e.g. vectorisation -> tf-idf weighting -> truncated SVD
  • Ability to persist trained models
  • LDA (Latent Dirichlet Allocation) implementation for topic extraction
  • Stemming to treat words with common root as the same e.g. "go" and "going"
  • Querying based on multiple query strings (using their centroid) rather than just a single query string.
  • Support partitioning for the Latent Semantic Index (LSI)
  • Clustering algorithms e.g. Heirachical, K-means, etc.
  • Classification algorithms e.g. SVM, random forest, etc.

References

  1. Wikipedia
  2. Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
  3. Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
  4. Thomo, Alex. Latent Semantic Analysis (Tutorial).
  5. Latent Semantic Indexing. Standford NLP Course

About

Selected Machine Learning algorithms for basic natural language processing in Golang

License:MIT License


Languages

Language:Go 100.0%