michalovadek / top2vecr

An R implementation of top2vec, a topic modelling technique relying on jointly learned document and word embeddings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

top2vecr

top2vecr is an R implementation of top2vec, a topic modelling technique relying on jointly learned document and word embeddings.

The main idea is that documents found close to each other in the joint document-word vector space can be interpreted as topics. Words similar to these document clusters are used as topic descriptors. UMAP is used to reduce the dimensionality of the original vector space – as produced by doc2vec – and HDBSCAN is used to identify document clusters.

As opposed to the original Python implementation, this package does not yet support the use of pre-trained sentence encoders and transformers.

Development halted due to performance limitations in UMAP's R implementation

Installation

You can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("michalovadek/top2vecr")

About

An R implementation of top2vec, a topic modelling technique relying on jointly learned document and word embeddings


Languages

Language:R 100.0%