bjoernpl / de_instruct

Repository from Github https://github.combjoernpl/de_instructRepository from Github https://github.combjoernpl/de_instruct

German Language LLM Instruction-Tuning

This README is an overview of ideas concerning finetuning LLMs to learn German instruction data.

Related work

The model

There a few valid model choices. Most applicable is likely to be Llama in the 60B parameter variant. 30B might also be sufficient.

Other options include Pythia, StableLM, Cerebras, GPT-NeoX-20B, ...

Finetuning on mixed translation / high-quality data

To best enable knowledge transfer, an idea is to give the model some mix of data of approximately:

  • 1/3 english -> german translated text (i.e. pairs of english and corresponding german text)
  • 1/3 german -> english translated text
  • 1/3 high-quality german texts (german wikipedia or gutenberg)

How to obtain this data?

Use open-source translation model (No Language Left Behind) to translate english wikipedia to german (and maybe vice-versa to ensure high-quality german text). See here for more details.

Finetuning on an instruct dataset

After language training, finetune the model on a diverse instruct dataset. Some thoughts:

  • meta-tags for conditional generation (funny, sad, pirate, angry, etc.)
  • ...

Multimodal support

Inspired by MiniGPT4 and with architecture/training considerations from Microsoft KOSMOS-1 adding image support is feasible.

  • Use pretrained (optionally frozen) LLM (as described above)
  • Use pretrained (frozen) image patch encoder (ViT, CLIP, DINOv2, etc.)
  • Send image features through 1 linear layer and then into model
  • Train on interleaved data. Alternatively use mix of interleaved, image-text pair and text data according to KOSMOS-1.

About