This README is an overview of ideas concerning finetuning LLMs to learn German instruction data.
instruct-igel-001
IGEL is built on top BigScience BLOOM, with naive-translated instruct dataset. SnipAId successfully use igel for news article summarization.- German BLOOM finetuned on crawled german data.
- Guanaco Llama 7b finetuned on Guanaco Dataset (chat/instructions in en, de, ja, zh)
- EuroInstructProject german instruction dataset
There a few valid model choices. Most applicable is likely to be Llama in the 60B parameter variant. 30B might also be sufficient.
Other options include Pythia, StableLM, Cerebras, GPT-NeoX-20B, ...
To best enable knowledge transfer, an idea is to give the model some mix of data of approximately:
- 1/3 english -> german translated text (i.e. pairs of english and corresponding german text)
- 1/3 german -> english translated text
- 1/3 high-quality german texts (german wikipedia or gutenberg)
How to obtain this data?
Use open-source translation model (No Language Left Behind) to translate english wikipedia to german (and maybe vice-versa to ensure high-quality german text). See here for more details.
After language training, finetune the model on a diverse instruct dataset. Some thoughts:
- meta-tags for conditional generation (funny, sad, pirate, angry, etc.)
- ...
Inspired by MiniGPT4 and with architecture/training considerations from Microsoft KOSMOS-1 adding image support is feasible.
- Use pretrained (optionally frozen) LLM (as described above)
- Use pretrained (frozen) image patch encoder (ViT, CLIP, DINOv2, etc.)
- Send image features through 1 linear layer and then into model
- Train on interleaved data. Alternatively use mix of interleaved, image-text pair and text data according to KOSMOS-1.