CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ByT5 encoder

kylebgorman opened this issue · comments

As a postlude to #72, I propose we make it possible to use ByT5 as the source (e.g., --source_encoder_arch byt5_base) and/or feature encoder. ByT5 is a byte-based pretrained transformer; in this mode we would be fine-tuning it.

This should become much easier to do upon completion of #72---we'd just implement a new encoder ByT5 in yoyodyne/models/modules/byt5.py. In the constructor, you'd use the transformers.T5Encoder.from_pretrained class method to instantiate an encoder; there are four sizes (small, base, large, xl) and we could just add all four.

I don't think I'd go about adding access to just any HuggingFace encoder though, as their tokenizers will be incompatible. If we think there are going to be a lot more of these, we could add some lightweight (i.e., built-in, not plug-in) registration mechanism that gives you one place to declare that HuggingFace encoder X is compatible with this library.

The tricky bit is: how does the model's tokenizer interact with our dataset config tokenization? Maybe we can just bypass theirs and add byte as a special-case separator option.

Here are some early notes on how to do this.