YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LTU_AS ASR Task

dingdongwang opened this issue · comments

Hello, thank you for providing such a good idea of research on audio question answering. I have some questions about the LTU_AS:

  1. For ASR task. During inference period(refer to the inference_batch.py), why should the input contain both the acoustic information (cur_audio_input) and the ground truth transcription as one of the input_ids information generated by prompter.generate_prompt()? In that case, the LLM only need to remember the prompt contained the transcription information to generate the answer, no need to combine the ascoustic information.

  2. What are the difference parts compared with official hugging face transformer, hugging face peft, openai-whisper. And if I want to do some modifications based on this codebase, what sub-folders of hf-dev/transformers-main, peft-main/, whisper/ should I locate to?

  3. During inference, I used the ltuas_long_noqa_a6.bin (default option in inference_batch.py), if this is the original base model? Since I still don't understand what the different model checkpoint files mean, and the different versions of ltu_as models mentioned in the README. Could you please give me more information?

Thank you fo much for your time!

For ASR task. During inference period(refer to the inference_batch.py), why should the input contain both the acoustic information (cur_audio_input) and the ground truth transcription as one of the input_ids information generated by prompter.generate_prompt()? In that case, the LLM only need to remember the prompt contained the transcription information to generate the answer, no need to combine the ascoustic information.

This is correct. However, the model also need to follow ASR instruction (i.e., know it needs to do ASR when asked) in additional to remember the input text. Practically, the WER of LTU-AS is higher (i.e., worse) than its internal Whisper model due to occasionally not follow instruction, or adding comments to the transcribed text. Please check section 5.1 of the paper.

What are the difference parts compared with official hugging face transformer, hugging face peft, openai-whisper. And if I want to do some modifications based on this codebase, what sub-folders of hf-dev/transformers-main, peft-main/, whisper/ should I locate to?

Please check https://github.com/YuanGongND/ltu#important-code. Changes are here and there, and there are changes not documented in the above documents. All changes have a purpose.

During inference, I used the ltuas_long_noqa_a6.bin (default option in inference_batch.py), if this is the original base model? Since I still don't understand what the different model checkpoint files mean, and the different versions of ltu_as models mentioned in the README. Could you please give me more information?

No. long means longer sequence length in training (so the model is likely to give longer answer), noqa means exluding questions that do not have a conclusive answer (so the model has a lower chance to say "I don't know"). We provide Original in Paper model that is exactly same with the model described in the paper and guarantee to have same performance as described.

However, ltuas_long_noqa_a6.bin might be a better model in practice. We documented the sequence length and settings of the model at https://github.com/YuanGongND/ltu#pretrained-models.

-Yuan

Thank so much for your reply!

I have another question about finetune.py code line 91:

  # trick to load checkpoints correctly from HF
  if '../../../pretrained_mdls/vicuna_ltuas/' not in base_model:
      # start from a different model with original vicuna
      # temporally first load the original vicuna, then load the actual checkpoint
      start_model = base_model # need to point to a specific bin file that contains state dict.
      # TODO: change to your vicuna_tltr path
      base_model = '../../../pretrained_mdls/vicuna_ltuas/'
      print('Will load from {:s} later, for implementation purpose, first load from {:s}'.format(start_model, base_model))
  else:
      start_model = None

If the base_model is not from the original vicuna_ltuas, why should the base_model change to vicuna_ltuas instead of the real checkpoint?

Besides, for stage 1 training, why should the base_model be vicuna_ltuas/ instead of the original official vicuna model?

Thank You again!

If the base_model is not from the original vicuna_ltuas, why should the base_model change to vicuna_ltuas instead of the real checkpoint?

Check this #14.

Besides, for stage 1 training, why should the base_model be vicuna_ltuas/ instead of the original official vicuna model?

vicuna_ltuas also includes pretrained audio encoder weights, original vicuna is a pure text model. I also changed the settings in the vicuna_ltuas's dir.

p.s. I also suggest to create a new issue for question irrelevant to the title, so other people can search it easier.

Thank You so much for your reply! Really appreciate it!