ByT5 encoder
kylebgorman opened this issue · comments
As a postlude to #72, I propose we make it possible to use ByT5 as the source (e.g., --source_encoder_arch byt5_base
) and/or feature encoder. ByT5 is a byte-based pretrained transformer; in this mode we would be fine-tuning it.
This should become much easier to do upon completion of #72---we'd just implement a new encoder ByT5
in yoyodyne/models/modules/byt5.py
. In the constructor, you'd use the transformers.T5Encoder.from_pretrained
class method to instantiate an encoder; there are four sizes (small
, base
, large
, xl
) and we could just add all four.
I don't think I'd go about adding access to just any HuggingFace encoder though, as their tokenizers will be incompatible. If we think there are going to be a lot more of these, we could add some lightweight (i.e., built-in, not plug-in) registration mechanism that gives you one place to declare that HuggingFace encoder X is compatible with this library.
The tricky bit is: how does the model's tokenizer interact with our dataset config tokenization? Maybe we can just bypass theirs and add byte
as a special-case separator option.
Here are some early notes on how to do this.