pommedeterresautee / projector

Project Dense Vectors Text Representation on 2D Plan

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

projectoR

Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge Follow

Project dense vector representations of texts on a 2D plan to better understand neural models applied to NLP.

VizProjector1

Introduction

Since the famous word2vec, embeddings are everywhere in NLP (and other close areas like IR).
The main idea behind embeddings is to represent texts (made of characters, words, sentences, or even larger blocks) as numeric vectors.
This works very well and provides some abilities unreachable with the classic BoW approach.
However, embeddings (e.g. vector representations) are difficult to understand, analyze (and debug) for humans because they are made of much more than just 3 dimensions.

One well known way to get a sense of understanding of embeddings is to to project them on a 2D scatter plot and visualize the distances between texts, search for clusters and so on...
This is the very purpose of this package!

2 algorithms can be used for projection:

  • PCA (rapid)
  • T-SNE (slow but better results)

An interactive module is also provided.

This package is partially inspired from Tensorflow projector.

Installation

You can install the projector package from Cran or Github as follows:

# From Cran
install.packages("projector")

# From Github
# install.packages("devtools")
devtools::install_github("pommedeterresautee/projector")

Demo code

The demo below uses a model embedded in fastrtext package for convenience.
This model is of a very low quality because of package size constraint from Cran. It is highly advised to use a model pretrained from Facebook on Wikipedia (size is of several Gb) available there.

library(projector)
library(fastrtext)

model_test_path <- system.file("extdata",
                               "model_unsupervised_test.bin",
                               package = "fastrtext")
model <- load_model(model_test_path)
# Viz below is from English Wikipedia fastText model
# model <- load_model("~/Downloads/wiki.en.bin")
word_embeddings <- get_word_vectors(model, words = head(get_dictionary(model), 5e5))

annoy_model <- get_annoy_model(word_embeddings, 5)

# pivot_word <- "friendship" # for Wikipedia viz
pivot_word <- "out"
df <- retrieve_neighbors(text = pivot_word, projection_type = "tsne", annoy_model = annoy_model, n = 500)
plot_texts(coordinates = df, min_cluster_size = 3)

VizProjector1

Interactive exploration

The exploration of the embeddings is even more powerful when you can play with them and see how it reacts.
For that purpose the interactive shiny application lets you declare a word as a pivot and discover who are the n closest neighbors.

interactive_embedding_exploration(annoy_model)

VizProjector2

About

Project Dense Vectors Text Representation on 2D Plan

License:Other


Languages

Language:R 85.2%Language:C++ 14.8%