German Language LLM Instruction-Tuning

This README is an overview of ideas concerning finetuning LLMs to learn German instruction data.

Related work

instruct-igel-001 IGEL is built on top BigScience BLOOM, with naive-translated instruct dataset. SnipAId successfully use igel for news article summarization.
German BLOOM finetuned on crawled german data.
Guanaco Llama 7b finetuned on Guanaco Dataset (chat/instructions in en, de, ja, zh)
EuroInstructProject german instruction dataset

The model

There a few valid model choices. Most applicable is likely to be Llama in the 60B parameter variant. 30B might also be sufficient.

Other options include Pythia, StableLM, Cerebras, GPT-NeoX-20B, ...

Finetuning on mixed translation / high-quality data

To best enable knowledge transfer, an idea is to give the model some mix of data of approximately:

1/3 english -> german translated text (i.e. pairs of english and corresponding german text)
1/3 german -> english translated text
1/3 high-quality german texts (german wikipedia or gutenberg)

How to obtain this data?

Use open-source translation model (No Language Left Behind) to translate english wikipedia to german (and maybe vice-versa to ensure high-quality german text). See here for more details.

Finetuning on an instruct dataset

After language training, finetune the model on a diverse instruct dataset. Some thoughts:

meta-tags for conditional generation (funny, sad, pirate, angry, etc.)
...

Multimodal support

Inspired by MiniGPT4 and with architecture/training considerations from Microsoft KOSMOS-1 adding image support is feasible.

Use pretrained (optionally frozen) LLM (as described above)
Use pretrained (frozen) image patch encoder (ViT, CLIP, DINOv2, etc.)
Send image features through 1 linear layer and then into model
Train on interleaved data. Alternatively use mix of interleaved, image-text pair and text data according to KOSMOS-1.