This repository contains an R package which wraps the StarSpace C++ library (https://github.com/facebookresearch/StarSpace), allowing the following:
- Text classification
- Learning word, sentence or document level embeddings
- Finding sentence or document similarity
- Ranking web documents
- Content-based or Collaborative filtering-based Recommendation, e.g. recommending music or videos.
This package is not on CRAN, you can only install it as follows: devtools::install_github("bnosac/ruimtehol", build_vignettes = TRUE)
This R package allows to Build Starspace models on your own text / Get embeddings of words/ngrams/sentences/documents / Get predictions from a model (e.g. classification / ranking) / Get nearest neighbours similarity
The following functions are made available.
Function | Functionality |
---|---|
starspace |
Low-level interface to build a Starspace model |
starspace_load_model |
Load a pre-trained model or a tab-separated file |
starspace_save_model |
Save a Starspace model to a tab-separated file |
starspace_embedding |
Get embeddings of documents/words/ngrams/labels |
starspace_knn |
Find k-nearest neighbouring information for new text |
starspace_dictonary |
Get words/labels with the model dictionary |
predict.textspace |
Get predictions along a Starspace model |
as.matrix |
Get words and labels embeddings |
embedding_similarity |
Basic cosine/dot product based similarity of embeddings |
embed_wordspace |
Build a Starspace model which calculates word/ngram embeddings |
embed_sentencespace |
Build a Starspace model which calculates sentence embeddings |
embed_articlespace |
Build a Starspace model for embedding an article and sentences-article similarities |
embed_tagspace |
Build a Starspace model for multi-label classification |
embed_docspace |
Build a Starspace model for content-based recommendation |
embed_pagespace |
Build a Starspace model for interest-based recommendation |
embed_entityrelationspace |
Build a Starspace model for entity relationship completion |
library(ruimtehol)
## Get some training data
download.file("https://s3.amazonaws.com/fair-data/starspace/wikipedia_train250k.tgz", "wikipedia_train250k.tgz")
x <- readLines("dev/wikipedia_train250k.tgz", encoding = "UTF-8")
x <- x[-c(1:9)]
x <- x[sample(x = length(x), size = 10000)]
writeLines(text = x, sep = "\n", con = "wikipedia_train10k.txt")
## Train
model <- starspace(file = "dev/wikipedia_train10k.txt", fileFormat = "labelDoc", dim = 10, trainMode = 3)
model
Object of class textspace
dimension of the embedding: 10
training arguments:
loss: hinge
margin: 0.05
similarity: cosine
epoch: 5
adagrad: TRUE
lr: 0.01
termLr: 1e-09
norm: 1
maxNegSamples: 10
negSearchLimit: 50
p: 0.5
shareEmb: TRUE
ws: 5
dropoutLHS: 0
dropoutRHS: 0
initRandSd: 0.001
embedding <- as.matrix(model)
embedding[c("school", "house"), ]
1 2 3 4 5 6 7 8 9 10
school 0.0201249 -0.00478271 -0.018693000 0.0155070 0.01113670 -0.0184385 0.00892674 0.00549661 -0.0144082 0.0056668
house 0.0123371 0.01406140 -0.000166073 0.0313477 -0.00962703 -0.0237911 0.00225086 0.03393420 0.0035634 0.0160656
## Save trained model as a binary file or as TSV so that you can inspect the embeddings e.g. with data.table::fread("wikipedia_embeddings.tsv")
starspace_save_model(model)
starspace_save_model(model, file = "wikipedia_embeddings.bin")
starspace_save_model(model, file = "wikipedia_embeddings.tsv", as_tsv = TRUE)
## Load a pre-trained model
model <- starspace_load_model("wikipedia_embeddings.bin")
## Get the document embedding
starspace_embedding(model, "The apps to predict / get nearest neighbours are still under construction.")
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] -0.4213823 0.4987145 -0.08066317 -0.6519815 -0.1743725 0.09401496 0.02670185 0.262726 0.1761705 0.04599866
starspace_knn(model, "What does this bunch of text look like", k = 10)
starspace_dictionary(model)
Below Starspace is used for classification
library(fastrtext)
library(ruimtehol)
data(train_sentences, package = "fastrtext")
filename <- tempfile()
writeLines(text = paste(paste0("__label__", train_sentences$class.text), tolower(train_sentences$text)),
con = filename)
model <- starspace(file = filename,
trainMode = 0, label = "__label__",
similarity = "dot", verbose = TRUE, initRandSd = 0.01, adagrad = FALSE,
ngrams = 1, lr = 0.01, epoch = 5, thread = 20, dim = 10, negSearchLimit = 5, maxNegSamples = 3)
predict(model, "We developed a two-level machine learning approach that in the first level considers two different
properties important for protein-protein binding derived from structural models of V3 and V3 sequences.")
- Why did you call the package ruimtehol? Because that is the translation of StarSpace in WestVlaams.
- The R wrapper is distributed under the Mozilla Public License 2.0. The package contains a copy of the StarSpace C++ code (namely all code under src/Starspace) which has a BSD license (which is available in file LICENSE.notes) and also has an accompanying PATENTS file which you can inspect here.
- The package has only been tested on Windows, Ubuntu and Debian at this stage
Need support in text mining? Contact BNOSAC: http://www.bnosac.be