stanford-crfm / mistral

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mistral doesn't join docs with the <|endoftext|> separator

nostalgebraist opened this issue · comments

Unlike GPT-2 and other GPT-style LMs, the Mistral codebase and pretrained models do not make use of the special <|endoftext|> token.

Evidence that this is true:

  1. When prompted with this token, the pretrained models usually begin in the middle of a sentence.
  2. If I understand correctly, this line in get_auto_dataset concatenates tokenized documents without inserting anything in between them.
  • If this code was used to prepare data for the pretrained models, that would explain the behavior noted in point 1.

Was this a deliberate choice? Mistral follows GPT-2 carefully in other respects, so I'm surprised by this difference.

Also, concatenating documents without inserting such a character seems sub-optimal from a language modeling perspective. At the boundaries between documents, it produces sudden discontinuities in style/content. The resulting dataset makes it look to the LM as if such discontinuities were a feature of natural text, which they aren't.

Hi @nostalgebraist,

Thanks so much for bringing this to our attention. I think this may have been an oversight on our part, and we just didn't include the <|endoftext|> token. We wrote this pre-processing code a while ago, consulting some of the Hugging Face examples, and I thought we had changed it (you may want to post an issue in HF Transformers as well around this point).

Agreed that this feels sub-optimal from an LM perspective. I've opened issue #91 to address this, and it'll get picked up by one of the development team members this week -- unless you're willing to try making the change! It'd be a great chance to join the Mistral community, and especially since you pointed it out, would be fitting to fix it as well 🙂.

Regardless, we'll fix this prior to training new models. For these current models, we hope they're still useable! If you want to use them for unprompted generation, I feel like you could probably fine-tune these cheaply on smaller datasets with the EOS token in place (since it's in the vocabulary), and the models would learn to incorporate them fairly quickly.

Thank you again for raising this! I'll leave the issue open to resolve any remaining questions.

Closing issue (saw that you reacted!) -- feel free to re-open if further questions arise.