KichangKim / DeepDanbooru

The dataset I am using to build the tag database and tags.txt has some letters that deepdanbooru crashes on. Specifically in my case, it does not like the letter ō, which produces the following error (abbreviated to show the relevant part):

  File "C:\Users\Kayli\AppData\Local\Programs\Python\Python310\lib\site-packages\deepdanbooru\data\dataset.py", line 7, in <genexpr>
    tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
  File "C:\Users\Kayli\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 6620: character maps to <undefined>

ō is a single character encoded as c5 8d, if it gets to 8d without understanding that it's part of a the previous character, something has already gone wrong.

It may be text file encoding issue.

If you can modify python code, test this fix:

DeepDanbooru/deepdanbooru/data/dataset.py

Line 6 in 05eb3c3

with open(tags_path, "r") as tags_stream:

def load_tags(tags_path):
    with open(tags_path, "r") as tags_stream:
        tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
        return tags

to

def load_tags(tags_path):
    with open(tags_path, "r", encoding="utf-8") as tags_stream:
        tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
        return tags

Error reading tags with Unicode in them