FLOTA

This repository contains the code and data for the ACL paper An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers. The paper introduces FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the tokenization of pretrained language models.

Additionally, this repository contains:

Updated code structure
Alternative tokenization method, based on the original FLOTA idea
Additional passing of prefixes and suffixes to tokenizer
Simple HTTP API for tokenizing/encoding words
Improved CLI

Installation

Using pipx:

pipx install "git+https://github.com/jnk22/flota.git#egg=flota[api,cli]"

Using poetry:

git clone https://github.com/jnk22/flota
cd flota
poetry install --extras "api cli"

The extra packages api and cli are only required for the CLI application and HTTP server. These can be omitted if only used as library.

Usage

Python library

from flota import AutoFlotaTokenizer, FlotaMode

# Original mode: FLOTA, k=3
flota = AutoFlotaTokenizer.from_pretrained("bert-base-uncased", FlotaMode.FLOTA, k=3)
print(flota.tokenize("visualization"))  # ['vis', '#ua', '##lization']

# Additional mode: FLOTA-DP
flota = AutoFlotaTokenizer.from_pretrained("bert-base-uncased", FlotaMode.FLOTA_DP)
print(flota.tokenize("visualization"))  # ['visual', '##ization']

CLI application

This requires the installation of extra packages CLI!

Run performance tests

flota run bert-base-uncased data/arxiv_cs_1e+02

Tokenize words

flota tokenize bert-base-uncased this is an example text to be tokenized

Encode words

flota encode bert-base-uncased this is an example text to be encoded

HTTP server

The FLOTA server is a demo backend that serves an HTTP API for demo purposes.

flota server --host 127.0.0.1 --port 8000

# In another terminal:
curl -X 'GET' 'http://127.0.0.1:8000/tokenize?word=visualization&model=bert-base-uncased&mode=flota'

Open http://127.0.0.1:8000/docs or http://127.0.0.1:8000/redoc for OpenAPI visualizations.

Data

arXiv Dataset (English)
Ten Thousand German News Articles Dataset (German)

All datasets are available in data.

Citation

If you use the code or data in this repository, please cite the following paper:

@inproceedings{hofmann2022flota,
    title = {An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers},
    author = {Hofmann, Valentin and Sch{\"u}tze, Hinrich and Pierrehumbert, Janet},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
    year = {2022}
}

jnk22 / flota