Continual pretraining for custom data is not working. Not recognizing TextFiles as a data attribute.

Question

Continual pretraining for custom data is not working. Not recognizing TextFiles as a data attribute.

karkeranikitha opened this issue a month ago · comments

Hi

When I am trying to run litgpt pretrain command for continual finetuning purpose, I am getting below error.
For custom data training, data parameter should be TextFiles and data.train_data_path should be folder with all text files as mentioned in readme file.

command: litgpt pretrain --model_name Llama-2-7b-hf --initial_checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --data TextFiles --data.train_data_path custom_texts --out_dir out/custom-model

Error:
usage: litgpt [options] pretrain [-h] [-c CONFIG] [--print_config[=flags]]
[--model_name MODEL_NAME]
[--model_config MODEL_CONFIG]
[--out_dir OUT_DIR]
[--initial_checkpoint_dir INITIAL_CHECKPOINT_DIR]
[--resume RESUME]
[--data.help CLASS_PATH_OR_NAME]
[--data DATA] [--train CONFIG]
[--train.save_interval SAVE_INTERVAL]
[--train.log_interval LOG_INTERVAL]
[--train.global_batch_size GLOBAL_BATCH_SIZE]
[--train.micro_batch_size MICRO_BATCH_SIZE]
[--train.lr_warmup_steps LR_WARMUP_STEPS]
[--train.epochs EPOCHS]
[--train.max_tokens MAX_TOKENS]
[--train.max_steps MAX_STEPS]
[--train.max_seq_length MAX_SEQ_LENGTH]
[--train.tie_embeddings {true,false,null}]
[--train.learning_rate LEARNING_RATE]
[--train.weight_decay WEIGHT_DECAY]
[--train.beta1 BETA1] [--train.beta2 BETA2]
[--train.max_norm MAX_NORM]
[--train.min_lr MIN_LR] [--eval CONFIG]
[--eval.interval INTERVAL]
[--eval.max_new_tokens MAX_NEW_TOKENS]
[--eval.max_iters MAX_ITERS]
[--devices DEVICES]
[--tokenizer_dir TOKENIZER_DIR]
[--logger_name {wandb,tensorboard,csv}]
[--seed SEED]
error: Parser key "data":
Does not validate against any of the Union subtypes
Subtypes: (<class 'litgpt.data.base.DataModule'>, <class 'NoneType'>)
Errors:
- Expected a dot import path string: TextFiles
- Expected a <class 'NoneType'>
Given value type: <class 'str'>
Given value: TextFiles

Reference:
https://github.com/Lightning-AI/litgpt?tab=readme-ov-file#continue-pretraining-an-llm
https://lightning.ai/lightning-ai/studios/litgpt-continue-pretraining?tab=files&layout=column&path=cloudspaces%2F01hvpn545vfd8615mxjf3zsbgh&y=4&x=0

Can someone please help with the issue

Thanks in advance

Sebastian Raschka · Answer 1 · Wed Apr 24 2024 07:13:16 GMT+0800 (China Standard Time)

Hi there,

the only issue I am seeing in your code is that you are missing the --tokenizer_dir. You can set it to

--tokenizer_dir checkpoints/meta-llama/Llama-2-7b-hf

assuming that your checkpoints/meta-llama/Llama-2-7b-hf folder has a tokenizer (it should be downloaded by default).

However, if you are missing the tokenizer_dir there is also a separate error:

Otherwise, the code you have there looks fine. I just tested it and it works without a problem:

My best explanation is that you perhaps have an older version of LitGPT installed that doesn't support TextFiles yet. I recommend installing litgpt from GitHub directly:

pip install -U git+https://github.com/Lightning-AI/litgpt.git

Nikitha Karkera · Answer 2 · Wed Apr 24 2024 20:20:32 GMT+0800 (China Standard Time)

@rasbt Thanks alot! It's working now. Installing with github directly resolved the issue

Sebastian Raschka · Answer 3 · Wed Apr 24 2024 20:55:45 GMT+0800 (China Standard Time)

Awesome, that's great to hear!