aimclub / AutoTM

Automatic hyperparameters tuning for topic models (ARTM approach) using evolutionary algorithms

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Library scheme

AutoTM

Project Status: Active – The project has reached a stable, usable state and is being actively developed. build License PyPI version Documentation Status Downloads

Automatic parameters selection for topic models (ARTM approach) using evolutionary algorithms. AutoTM provides necessary tools to preprocess english and russian text datasets and tune topic models.

What is AutoTM?

Topic modeling is one of the basic methods for EDA of unlabelled text data. While ARTM (additive regularization for topic models) approach provides the significant flexibility and quality comparative or better that neural approaches it is hard to tune such models due to amount of hyperparameters and their combinations.

To overcome the tuning problems AutoTM presents an easy way to represent a learning strategy to train specific models for input corporas.

Learning strategy representation

Optimization procedure is done by genetic algorithm which operators are specifically tuned for the task. To speed up the procedure we also implemented surrogate modeling that, for some iterations, approximate fitness function to reduce computation costs on training topic models.

Library scheme

Installation

! Note: The functionality of topic models training is available only for linux distributions.

Via pip:

pip install autotm

From source:

pip install -r requirements.txt

python -m spacy download en_core_web_sm

export PYTHONPATH="${PYTHONPATH}:/path/to/src"

Quickstart

The notebook with an example is available in examples folder.

Running from the command line

To fit a model: autotmctl --verbose fit --config conf/config.yaml --in data/sample_corpora/sample_dataset_lenta.csv

To predict with a fitted model: autotmctl predict --in data/sample_corpora/sample_dataset_lenta.csv --model model.artm

Backlog:

  • Add tests
  • Add new multi-stage

Citation

@article{10.1093/jigpal/jzac019,
    author = {Khodorchenko, Maria and Butakov, Nikolay and Sokhin, Timur and Teryoshkin, Sergey},
    title = "{ Surrogate-based optimization of learning strategies for additively regularized topic models}",
    journal = {Logic Journal of the IGPL},
    year = {2022},
    month = {02},
    issn = {1367-0751},
    doi = {10.1093/jigpal/jzac019},
    url = {https://doi.org/10.1093/jigpal/jzac019},}

About

Automatic hyperparameters tuning for topic models (ARTM approach) using evolutionary algorithms

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 81.7%Language:Jupyter Notebook 7.0%Language:Shell 5.4%Language:Jinja 1.4%Language:Dockerfile 1.3%Language:JavaScript 1.3%Language:Smarty 0.9%Language:CSS 0.5%Language:HTML 0.3%