AkariAsai / self-rag

This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.

Home Page:https://selfrag.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is the FT script correct?

emrgnt-cmplxty opened this issue · comments

Hi,

I ran the fine-tune script on Mistral base model but found rather poor results on ARC Challenge (<50% with retrieval). Any ideas why? I will repeat with Mistral Instruct to see if it makes a beneficial difference, but I am not optimistic as I have seen similar poor results when fine-tuning this model with the self-rag dataset and script.

MODEL_SIZE=7B
NUM_GPUS=8
BATCH_SIZE_PER_GPU=1
TOTAL_BATCH_SIZE=128
GRADIENT_ACC_STEPS=$(($TOTAL_BATCH_SIZE/$NUM_GPUS/$BATCH_SIZE_PER_GPU))
echo "Training llama model ${MODEL_SIZE} using $NUM_GPUS GPUs, $BATCH_SIZE_PER_GPU batch size per GPU, $GRADIENT_ACC_STEPS gradient accumulation steps"

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
    --mixed_precision bf16 \
    --num_machines 1 \
    --num_processes $NUM_GPUS \
    --use_deepspeed \
    --deepspeed_config_file stage3_no_offloading_accelerate.conf \
    finetune.py \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --use_flash_attn \
    --tokenizer_name mistralai/Mistral-7B-v0.1 \
    --use_slow_tokenizer \
    --train_file full_output_1005.jsonl \
    --max_seq_length 2048 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size $BATCH_SIZE_PER_GPU \
    --gradient_accumulation_steps $GRADIENT_ACC_STEPS \
    --learning_rate 2e-5 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.03 \
    --weight_decay 0. \
    --num_train_epochs 5 \
    --output_dir output/mistral_root_${MODEL_SIZE}/ \
    --with_tracking \
    --report_to tensorboard \
    --logging_steps 1 \
    --use_special_tokens

EDIT: I had a chance to look into this today, I am fairly confident the issue is that this script will NOT work for a model that has not had the tokenizer independently prepared. Will confirm and close the issue - it might be nice to add some information on how to independently replicate the result fine-tuning from scratch.

Hi @emrgnt-cmplxty, thank you for tyring it out! I haven't tried Mistral by myself yet, so I am not sure how they process new special tokens... For Llama2-7b/13B or Llama1, we didn't have any issue adding special tokens.
Yes I'd recomment to doube check whether the special tokens are proprietly added and used during fine-tuning (e.g., print out the tokenized output and see if the special tokens are appearing in the processed output).
Also at one point, my co-author @yizhongw and I found that llama2 tokenizer was adding [UNK] tokens a lot and hurted the performance sigificantly when Yizhong was using slightly olrder version of huggingface transformers. Probably it might also help to double check whether the tokenizer output does not have any weird [UNK] tokens.

@AkariAsai Would it be a large hassle to outline how to extend the tokenizer with the process that you used? I think this would be very helpful for myself and for others. This would also allow us to use other training software, like Axolotl.

If you are loading model checkpoints from huggingface transformers, it only requires two lines of code.

  1. tokenizer.add_special_tokens to expand the special tokens - https://github.com/AkariAsai/self-rag/blob/main/retrieval_lm/finetune.py#L460
  2. model.resize_token_embeddings to expand the embedding size - https://github.com/AkariAsai/self-rag/blob/main/retrieval_lm/finetune.py#L481

But I am not sure why the Mistra-7B fine-tuning got lower scores, though... I can take a look at fine-tuning of Mistra early next week.

I could not get the finetune script in the directory to work on Mistral.

However, I ran the steps above to update the tokenizer of the model (mistral ft'ed on textbooks) I wished to further fine-tune and then ran over the self-rag dataset with Axolotl. With my config I was able to complete two epochs in just several hours. The outputted model is here. I found an ARC challenge score of 75% using the same args as described in the repo.

Great work self-rag team, this looks really impressive. I will have the full pipeline online and easy to access shortly.

EDIT: Doing two more epochs now, to see how further tuning impacts the scores.

Cool, congrats!! Thank you so much for all of the help & contributions!

If you are implementing your own fine-tuning script , another key thing for Self-RAG is to add cntext markup to markup the retrived context surrounded by <paragraph> & </paragraph>.
I found even without this our model often performs fine in open-domain QA or classification tasks, but for long-form QA, this might be crucial (e.g., without this, the model starts generating paragraph by itself.
https://github.com/AkariAsai/self-rag/blob/main/retrieval_lm/finetune.py#L274

Oh, I see. I did not add this into my FT logic. I have set my completion rules to stop on <paragraph> tokens and everything seems to be working as expected. What is the impact of missing this logic? Is there any way to pre-compute this and then re-upload the data?

Either way, I will try to integrate the proper functionality for this into my workflow with Axolotl, though this is a framework I am still picking up.

Lastly, one thing I am noticing is that my FT'ed model attempts to retrieve after every completion when writing long-form content. Have you seen this before? Is it likely to be related to the failure to use the logic you outlined above?

EDIT - Disregard the first question. Reading through the code a second time, I now see that failing to mask the paragraph tokens will mean that the model is trained to predict them.

First pass is online now - https://www.reddit.com/r/LocalLLaMA/comments/17knjfz/update_from_sciphi_introducing/?rdt=55834.

The model is looking quite powerful for the size. I am hopeful that more people will continue to build on the self-rag work.

This is a fantastic news! Thank you so much for all the work! I'll add mentions to this model in our README.

I am closing this issue now, but feel free to reopen it!