Japanese LLaMa experiment.

Japanese version README-ja.md

Status

Japanese dataset pre-cleaning
Japanese dataset quality filtering
Japanese dataset dedup
Incremental pre-training
Fine-tuning with Japanese finetuning dataset.

Requirements

(Mini)conda
Python 3.10+
- Python 3.8+ may work.
CMake and C++17 compiler
- Install via sudo apt-get install build-essential for Ubuntu, or
- conda install -c conda-forge cxx-compiler
- conda install -c conda-forge cmake

Setup

To prepare Japanese dataset

KenLM

Build and install python module.

$ sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

$ git clone https://github.com/kpu/kenlm
$ cd kenlm
$ python setup.py bdist_wheel
$ python -m pip install -U dist/kenlm*.whl

sentencepiece

$ sudo apt install sentencepiece

Download SentencePiece and KenLM pretrained model(for ja language)

$ bash download_lm.sh

Install

Setup python environment using conda.

We need to create two conda environment, since spacy-transformers module(used in ginza module) requires older transformers version, which does not support Llama class(fail to import LlamaTokenizer from transformers)

$ conda create -n jp-llama-experiment python=3.10
$ conda activate jp-llama-experiment
$ python -m pip install -r requirements.txt

$ conda deactivate
$ conda create -n jp-llama-experiment-nlp python=3.10
$ conda activate jp-llama-experiment-nlp
$ python -m pip install -r requirements-ja-nlp.txt

Steps

Download datasets.
Run dataset cleaner
Train Japanese Tokeniezr
Merge Japanese Tokenizer into LLaMa Tokenizer
LoRA incremental training using Japanese Tokenizer
Finetune with Japanese dataset(e.g. Alpaca)

Download datasets

This is a required stop to train Tokenier, build KenLM model, etc.

cc100ja
mc4 ja
OSCAR2301 ja
wiki40b/ja

See 00_download_dataset for details.

Run dataset cleaner & dedup

Train Japanese Tokenizer

W.I.P.

cc100 ja で日本語 tokenizer を huggingface tokenizers で train するメモ https://zenn.dev/syoyo/articles/8647ae42a3be63

for details(in Japanese)

Train Japanese Tokenizer from cc100 ja.

It will download 40 GB of cc100 ja datset(75 GB uncompressed).

train_jp_tokenizer.py

128 GB CPU memory is required to train Japanese Tokenizer. After downloading

Merge Japanese Tokenizer vocab into LLaMa tokenizer

T.B.W.

Incremental training using Japanese Tokenizer.

This step take a time to train.

T.B.W.

Finetune with Japanese dataset(e.g. Alpaca)

T.B.W.

TODO

Japanese specific line-wise filtering
Exact Dedup using Suffix Array

License

MIT license unless licensing terms is not explicitly denoted. Some scripts are licensed under Apache 2.0 or BSD.

Third party licenses

Chinese LLaMa: Apache 2.0: https://github.com/ymcui/Chinese-LLaMA-Alpaca
cc_net: MIT License https://github.com/facebookresearch/cc_net
utf8proc: MIT license + permissive Unicode data license https://github.com/JuliaStrings/utf8proc
jagger: We choose BSD license. https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/
c4-dataset-script: MIT license. https://github.com/shjwudp/c4-dataset-script

About

Japanese LLaMa experiment

MIT License

Languages

Language:C 65.3%Language:C++ 29.8%Language:Python 2.1%Language:Cuda 1.4%Language:Metal 0.7%Language:Objective-C 0.5%Language:CMake 0.1%Language:Shell 0.0%Language:Makefile 0.0%