KichangKim / DeepDanbooru

AI based multi-label girl image classification system, implemented by using TensorFlow.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error reading tags with Unicode in them

Kayliii opened this issue · comments

The dataset I am using to build the tag database and tags.txt has some letters that deepdanbooru crashes on. Specifically in my case, it does not like the letter ō, which produces the following error (abbreviated to show the relevant part):

  File "C:\Users\Kayli\AppData\Local\Programs\Python\Python310\lib\site-packages\deepdanbooru\data\dataset.py", line 7, in <genexpr>
    tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
  File "C:\Users\Kayli\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 6620: character maps to <undefined>

ō is a single character encoded as c5 8d, if it gets to 8d without understanding that it's part of a the previous character, something has already gone wrong.

It may be text file encoding issue.

If you can modify python code, test this fix:

with open(tags_path, "r") as tags_stream:

def load_tags(tags_path):
    with open(tags_path, "r") as tags_stream:
        tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
        return tags

to

def load_tags(tags_path):
    with open(tags_path, "r", encoding="utf-8") as tags_stream:
        tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
        return tags