QwenLM / Qwen-Audio

The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

关于Output Instruction的问题

jwang1993 opened this issue · comments

您好,论文中有提到 “Output Instruction: Lastly, we provide output instruction to further specify the task and desired format for different subtasks, and then the text output begins.”

以下这些Output Instruction在训练和推理阶段是如何使用的?
我的理解是Output Instruction 放在prompt 结尾,如:
query = f"{audio_url}{sp_prompt}"
其中sp_prompt是"<|startofanalysis|><|unknown|><|keyword|><|zh|><|notimestamps|><|wo_itn|><|audioset_ontology|>"
不知道这种理解对不对?

Output Instruction

        "<|caption_audiocaps|>",  # Audiocaps caption style
        "<|caption_clotho|>",  # Clotho caption style
        "<|audioset_ontology|>",  # Audioset ontology style
        "<|caption_plain|>",  # plain caption
        "<|itn|>",  # inversed text normalized
        "<|wo_itn|>",  # without inversed text normalized
        "<|startofentityvalue|>",
        "<|endofentityvalue|>",
        "<|startofentitytype|>",
        "<|endofentitytype|>",
        "<|named_entity_recognition|>",  # named entity recognition task
        "<|audio_grounding|>",
        "<|startofword|>",
        "<|endofword|>",
        "<|delim|>",  # delimiter of timestamps pair in audio grounding
        "<|emotion_recognition|>",  # emotion recognition
        "<|music_description|>",  # music description
        "<|note_analysis|>",  # note analysis
        "<|pitch|>",  # note analysis: pitch
        *[f"<|midi_pitch_{i}|>" for i in range(128)],  # midi pitch 0-127
        "<|velocity|>",  # note analysis: velocity
        *[f"<|midi_velocity_{i}|>" for i in range(128)],  # midi velocity 0-127
        "<|sonic|>",  # note analysis:  sonic
        "<|instrument|>",  # note analysis:  instrument
        "<|speaker_meta|>",  # meta information of speaker
        "<|song_meta|>",  # meta information of song
        "<|question|>",  # AQA: question
        "<|answer|>",  # AQA: answer
        "<|choice|>",  # AQA: answer choice
        "<|scene|>",  # scene recognition
        "<|event|>",  # sound event
        "<|vocal_classification|>",  # vocal classification
        "<|speech_understanding|>",  # speech language understanding
        "<|scenario|>",  # speech language understanding: scenario
        "<|action|>",  # speech language understanding: action
        "<|entities|>",  # speech language understanding: entities
        "<|speech_edit|>",  # speech edit

'{}<|startofanalysis|><|unknown|><|caption|><|en|><|notimestamps|><|caption_{}|>'