QwenLM / Qwen-Audio

The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

low gender classification accuracy

yl4579 opened this issue · comments

The model seems to not even able to get the gender correctly, a few samples:

question = 'Recognize the gender, age, accent, emotion, and speaking content of the person in the audio, and combine these to answer his/her questions while explaining the reasons for these answers.' # same question as in homepage
query = tokenizer.from_list_format([
    {'audio': audio},
    {'text': question},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
  • https://vocaroo.com/11gfffDeXNmQ output: The speaker of this audio is a man speaking, in a English, saying, "you know as well as i do the kind of life you offer her.".
  • https://vocaroo.com/12xbA5EZX60M output: The audio is of a man speaking, in a neutral emotion, saying, "he says no word of happiness.".
  • https://vocaroo.com/19cMpEhrfHye output: The audio is of a man speaking, in a neutral emotion, saying, "the boy‘s face was very pale as he dropped his hands from penny’s shoulders ; but dundee, from behind the portieres, was not troubling to spy for the moment.".
  • https://vocaroo.com/12hdkCS6fhYx output: The audio is of a woman speaking, in a neutral emotion, saying, "when zarathustra once told this to his disciples they asked him, and what, o zarathustra, is the moral of thy story? and zarathustra answered them thus.".

The classification accuracy for gender is lower than simple F0 cutoff with this model, which is around 75%.