ctrlaltf2 / phonemes

An experiment in tokenization, grounded in linguistics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

/ˈfoʊ.nimz/

An experiment in tokenization

Motivation

Language is a signed-and-spoken-first construct. These two senses – sight and sound – are how language evolved. Writing, on the other hand, is a messy and imperfect mapping from the raw atoms of a spoken language to a textual representation, yet we use it to train large language models with seemingly little thought into the roots of where it comes from. Because of its messiness, there may exist a more efficient means of processing text streams than just byte pair encoding (BPE) alone. My take on this problem is to map orthography back into its raw spoken form. This normalizes a lot of language, and should enable a second pass of a standard tokenizer to better break up text into its specific morphemes. My hypothesis is that the more tokens which tend towards directly mapping to single morphemes, the easier it will be for a model to perform, be it less data or parameters for the same level of performance. By targeting model efficiency, my hope is to make language model training and inference a bit more democratized and usable for the average person.

Introduction

  • TODO

Building

Hard Dependencies

  • CMake >= 3.27
  • A C++ compiler supporting C++20

Do the usual CMake dance of mkdir build && cd build && cmake .. && make and you should get a successful build. CMake handles all other dependencies. As of now, this project depends only on espeak-ng.

Roadmap

Tokenizer

  • Integrate espeak-ng in C++ project
  • From C++, phonemize a string
  • Design algorithm for splitting text into words so single-words can be fed to espeak-ng
    • Adjacent words affect pronounciation in espeak-ng. While true to life, this is annoying and introduces some noise into the dataset, likely about the same noise that I'm trying to remove by phonemizing in the first place.
    • Use-case of programming should stay. Spaces shouldn't be collapsed, moved, or otherwise touched when going through this. If two words have two spaces between them, that should stay. Phonemization should only affect words.
  • Implement a fast parallel data preprocessor to phonemize en-masse (this repository)

Training

  • Create a derivative dataset of TinyStories (arXiv:2305.07759) that is the phonemized version of the corpus --> TinyStories-Pho
  • Run BPE over TinyStories-Pho using the paper's BPE parameters --> TinyStories-Pho.tokenizer
  • Run WordPiece over TinyStories-Pho and do a spot check comparison on how it tokenizes vs. BPE. WordPiece may synergize with the Pho dataset better than BPE.
  • Run BPE over TinyStories using the paper's BPE parameters if available --> TinyStories.tokenizer
  • Run BPE over TinyStories-Pho using a tuned set of BPE parameters --> TinyStories-Pho-Mod.tokenizer
  • Run BPE over TinyStories using the same tuned set of BPE parameters --> TinyStories-Mod.tokenizer
    • The hypothesis is that phonemization may lessen the "optimal" vocabulary size for the tokenizer because of the normalization effect it has on language.
  • Setup ColossalAI, train a simple test model on my RTX 3060 12GB + 96GB machine
  • Setup training code for the TinyStories model
  • Train on TinyStories using TinyStories.tokenizer --> TinyStories.safetensors
    • Verify output of TinyStories. Results should be at least similar to the paper. Due to the design of this study, it doesn't have to be exact, because this training process will be repeated with only one variable changed at a time (tokenizer, then tokenizer parameters).
  • Train on TinyStories-Pho using TinyStories-Pho.tokenizer --> TinyStories-Pho.safetensors
  • Train on TinyStories-Pho using TinyStories-Pho-Mod.tokenizer --> TinyStories-Pho-Mod.safetensors
  • Train on TinyStories using TinyStories-Mod.tokenizer --> Tinystories-Mod.safetensors
    • Allows ruling out TinyStories having bad BPE params to start
  • Compare models & summarize results
  • If positive, follow-up: expand to other languages and redesign the tokenizer make less assumptions of its data. The less assumptions, the better. Encoding specific knowledge into the tokenizer might not be a great idea due to The Bitter Lesson

About

An experiment in tokenization, grounded in linguistics

License:GNU General Public License v3.0


Languages

Language:C++ 60.0%Language:CMake 35.0%Language:Nix 5.0%