/ˈfoʊ.nimz/

An experiment in tokenization

Motivation

Language is a signed-and-spoken-first construct. These two senses – sight and sound – are how language evolved. Writing, on the other hand, is a messy and imperfect mapping from the raw atoms of a spoken language to a textual representation, yet we use it to train large language models with seemingly little thought into the roots of where it comes from. Because of its messiness, there may exist a more efficient means of processing text streams than just byte pair encoding (BPE) alone. My take on this problem is to map orthography back into its raw spoken form. This normalizes a lot of language, and should enable a second pass of a standard tokenizer to better break up text into its specific morphemes. My hypothesis is that the more tokens which tend towards directly mapping to single morphemes, the easier it will be for a model to perform, be it less data or parameters for the same level of performance. By targeting model efficiency, my hope is to make language model training and inference a bit more democratized and usable for the average person.

Introduction

TODO

Building

Hard Dependencies

CMake >= 3.27
A C++ compiler supporting C++20

Do the usual CMake dance of mkdir build && cd build && cmake .. && make and you should get a successful build. CMake handles all other dependencies. As of now, this project depends only on espeak-ng.

Roadmap

Tokenizer

Integrate espeak-ng in C++ project
From C++, phonemize a string
Design algorithm for splitting text into words so single-words can be fed to espeak-ng
- Adjacent words affect pronounciation in espeak-ng. While true to life, this is annoying and introduces some noise into the dataset, likely about the same noise that I'm trying to remove by phonemizing in the first place.
- Use-case of programming should stay. Spaces shouldn't be collapsed, moved, or otherwise touched when going through this. If two words have two spaces between them, that should stay. Phonemization should only affect words.
Implement a fast parallel data preprocessor to phonemize en-masse (this repository)

ctrlaltf2 / phonemes