This repository contains the code and data for the ACL paper An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers. The paper introduces FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the tokenization of pretrained language models.
Additionally, this repository contains:
- Updated code structure
- Alternative tokenization method, based on the original FLOTA idea
- Additional passing of prefixes and suffixes to tokenizer
- Simple HTTP API for tokenizing/encoding words
- Improved CLI
Using pipx:
pipx install "git+https://github.com/jnk22/flota.git#egg=flota[api,cli]"
Using poetry:
git clone https://github.com/jnk22/flota
cd flota
poetry install --extras "api cli"
The extra packages api and cli are only required for the CLI application and HTTP server. These can be omitted if only used as library.
from flota import AutoFlotaTokenizer, FlotaMode
# Original mode: FLOTA, k=3
flota = AutoFlotaTokenizer.from_pretrained("bert-base-uncased", FlotaMode.FLOTA, k=3)
print(flota.tokenize("visualization")) # ['vis', '#ua', '##lization']
# Additional mode: FLOTA-DP
flota = AutoFlotaTokenizer.from_pretrained("bert-base-uncased", FlotaMode.FLOTA_DP)
print(flota.tokenize("visualization")) # ['visual', '##ization']
This requires the installation of extra packages CLI!
flota run bert-base-uncased data/arxiv_cs_1e+02
flota tokenize bert-base-uncased this is an example text to be tokenized
flota encode bert-base-uncased this is an example text to be encoded
The FLOTA server is a demo backend that serves an HTTP API for demo purposes.
flota server --host 127.0.0.1 --port 8000
# In another terminal:
curl -X 'GET' 'http://127.0.0.1:8000/tokenize?word=visualization&model=bert-base-uncased&mode=flota'
Open http://127.0.0.1:8000/docs or http://127.0.0.1:8000/redoc for OpenAPI visualizations.
- arXiv Dataset (English)
- Ten Thousand German News Articles Dataset (German)
All datasets are available in data
.
If you use the code or data in this repository, please cite the following paper:
@inproceedings{hofmann2022flota,
title = {An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers},
author = {Hofmann, Valentin and Sch{\"u}tze, Hinrich and Pierrehumbert, Janet},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
year = {2022}
}