文本识别: RecLmdbDataset数据集存在严重bug
wkailiu opened this issue · comments
wkailiu commented
Bug1: 没有读取字母表,而是读取的文件名
将
替换为:
with open(config.alphabet, 'r', encoding='utf-8') as file:
alphabet = ''.join([s.strip('\n') for s in file.readlines()])
alphabet += ' '
self.str2idx = {c: i for i, c in enumerate(alphabet)}
Bug2: self.dict 缺少 空格符号
将
PytorchOCR/torchocr/utils/label_convert.py
Lines 20 to 25 in 8089322
替换为:
dict_character.append(" ")
self.dict = {}
for i, char in enumerate(dict_character):
# NOTE: 0 is reserved for 'blank' token required by CTCLoss
self.dict[char] = i + 1
#TODO replace ‘ ’ with special symbol
self.character = ['[blank]'] + dict_character # dummy '[blank]' token for CTCLoss (index 0)