Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.

Home Page:https://lightning.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make `save_hyperparameters()` robust against different CLI entry points

awaelchli opened this issue · comments

If you run with

litgpt finetune ...

and when getting to saving a checkpoint, we hit this line:

save_hyperparameters(setup, save_path.parent)

which reruns the CLI and parses the args that were passed. But this no longer works because it's not the same parser.

Saving LoRA weights to 'out/finetune/lora-llama2-7b/step-000200/lit_model.pth.lora'
usage: litgpt [-h] [--config CONFIG] [--print_config[=flags]] [--precision PRECISION] [--quantize QUANTIZE] [--devices DEVICES] [--seed SEED] [--lora_r LORA_R]
              [--lora_alpha LORA_ALPHA] [--lora_dropout LORA_DROPOUT] [--lora_query {true,false}] [--lora_key {true,false}] [--lora_value {true,false}]
              [--lora_projection {true,false}] [--lora_mlp {true,false}] [--lora_head {true,false}] [--data.help CLASS_PATH_OR_NAME] [--data DATA]
              [--checkpoint_dir CHECKPOINT_DIR] [--out_dir OUT_DIR] [--logger_name {wandb,tensorboard,csv}] [--train CONFIG] [--train.save_interval SAVE_INTERVAL]
              [--train.log_interval LOG_INTERVAL] [--train.global_batch_size GLOBAL_BATCH_SIZE] [--train.micro_batch_size MICRO_BATCH_SIZE]
              [--train.lr_warmup_steps LR_WARMUP_STEPS] [--train.epochs EPOCHS] [--train.max_tokens MAX_TOKENS] [--train.max_steps MAX_STEPS]
              [--train.max_seq_length MAX_SEQ_LENGTH] [--train.tie_embeddings {true,false,null}] [--train.learning_rate LEARNING_RATE]
              [--train.weight_decay WEIGHT_DECAY] [--train.beta1 BETA1] [--train.beta2 BETA2] [--train.max_norm MAX_NORM] [--train.min_lr MIN_LR] [--eval CONFIG]
              [--eval.interval INTERVAL] [--eval.max_new_tokens MAX_NEW_TOKENS] [--eval.max_iters MAX_ITERS]
error: Unrecognized arguments: finetune lora

A initial hack to fix this was done in #1103.
Comment by @carmocca
#1103 (comment)

How do you think this could be done? Do we need to choose between jsonargparse.CLI or the CLI in main and then pass the correct one to capture_parser?

We could also simplify this by not having a CLI in the scripts themselves.

We need to make it more robust.

It's not clear how to make this more robust. Perhaps the best way is to drop support for
python litgpt/finetune/lora.py since we no longer advertise it

commented

This is also an issue with

litgpt pretrain -c config_hub/pretrain/tinystories.yaml

...
...
Total parameters: 15,192,288
/graft3/datasets/user/tinystories/TinyStories_all_data already exists, skipping unpacking...
Validating ...
Measured TFLOPs: 3.16
Epoch 1 | iter 80 step 20 | loss train: 10.311, val: n/a | iter time: 298.13 ms (step) remaining time: 1 day, 19:48:16
Saving checkpoint to '/graft3/checkpoints/user/ctt/out/pretrain/stories15M/step-00000020/lit_model.pth'
usage: litgpt [-h] [--config CONFIG] [--print_config[=flags]] [--model_name MODEL_NAME]
              [--model_config MODEL_CONFIG] [--out_dir OUT_DIR]
              [--initial_checkpoint_dir INITIAL_CHECKPOINT_DIR] [--resume RESUME]
              [--data.help CLASS_PATH_OR_NAME] [--data DATA] [--train CONFIG]
              [--train.n_ciphers N_CIPHERS] [--train.save_interval SAVE_INTERVAL]
              [--train.log_interval LOG_INTERVAL] [--train.global_batch_size GLOBAL_BATCH_SIZE]
              [--train.micro_batch_size MICRO_BATCH_SIZE]
              [--train.lr_warmup_steps LR_WARMUP_STEPS] [--train.epochs EPOCHS]
              [--train.max_tokens MAX_TOKENS] [--train.max_steps MAX_STEPS]
              [--train.max_seq_length MAX_SEQ_LENGTH] [--train.tie_embeddings {true,false,null}]
              [--train.learning_rate LEARNING_RATE] [--train.weight_decay WEIGHT_DECAY]
              [--train.beta1 BETA1] [--train.beta2 BETA2] [--train.max_norm MAX_NORM]
              [--train.min_lr MIN_LR] [--eval CONFIG] [--eval.interval INTERVAL]
              [--eval.max_new_tokens MAX_NEW_TOKENS] [--eval.max_iters MAX_ITERS]
              [--devices DEVICES] [--tokenizer_dir TOKENIZER_DIR]
              [--logger_name {wandb,tensorboard,csv}] [--seed SEED]
error: Unrecognized arguments: -c config_hub/pretrain/tinystories.yaml