FlagAI-Open / FlagAI

Description

FlagAI/flagai/model/predictor/aquila_server.py

Line 89 in c006db0

if any('�' != c for c in tmp):

这一行的判断条件是否有误?
id: 41209 解码之后为 " �" （含有一个空格，会逃过这个判定条件）

Alternatives

No response

您好，我们看了一下，发现确实有一些问题，这个if判断应该是出现了tokenizer无法解码的情况，因此此时应该将乱码字符替换为空字符串，如下：

if '�' in tmp:
  next_token_list.append(next_token.cpu().numpy()[0])
  tmp = tokenizer.decode(next_token_list)
  if any('�' != c for c in tmp):
    next_token_list = []
    res_list += tmp.replace("�", "")
    if len(res_list) >= 10:
        print(res_list)
        yield res_list
        res_list = ""

您好，我们看了一下，发现确实有一些问题，这个if判断应该是出现了tokenizer无法解码的情况，因此此时应该将乱码字符替换为空字符串，如下：

if '�' in tmp:
  next_token_list.append(next_token.cpu().numpy()[0])
  tmp = tokenizer.decode(next_token_list)
  if any('�' != c for c in tmp):
    next_token_list = []
    res_list += tmp.replace("�", "")
    if len(res_list) >= 10:
        print(res_list)
        yield res_list
        res_list = ""

乱码字符的出现我的理解是有些字符罕见，所以由多个id解码成一个正常字符，单个字符解码时就是乱码，比如[41209, 240, 236] 最后解码成“ 👍” （前面有一个空格），所以我理解这边不应该进入if 语句清空next_token_list的，但是奇怪的是41209解码后是空格+ �，所以if any(...) 判断为true 进入了if 条件，才是问题来源
@920232796

多谢，您这个情况确实，是三个id才能进行解码，您看这样改的话，如何？

if len(next_token_list) == 0:
                    tmp = tokenizer.decode(next_token.tolist())
                else :
                    next_token_list.append(next_token.cpu().numpy()[0])
                    tmp = tokenizer.decode(next_token_list)

                if '�' in tmp and len(next_token_list) < 5:
                    if len(next_token_list) == 0:
                        next_token_list.append(next_token.cpu().numpy()[0])
                else:
                    #print(tmp)
                    next_token_list = []
                    res_list += tmp
                    if len(res_list) >= 10:
                        print(res_list)
                        yield res_list
                        res_list = ""

                tokens[:, cur_pos] = next_token
                prev_pos = cur_pos

主要就是使用next_token_list控制，如果tmp有乱码符号，就继续往next_token_list里面添加一直到tmp不存在乱码符号。

[Question]: 流式生成出现乱码

Description

Alternatives