Error reading tags with Unicode in them
Kayliii opened this issue · comments
Kayliii commented
The dataset I am using to build the tag database and tags.txt has some letters that deepdanbooru crashes on. Specifically in my case, it does not like the letter ō
, which produces the following error (abbreviated to show the relevant part):
File "C:\Users\Kayli\AppData\Local\Programs\Python\Python310\lib\site-packages\deepdanbooru\data\dataset.py", line 7, in <genexpr>
tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
File "C:\Users\Kayli\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 6620: character maps to <undefined>
ō
is a single character encoded as c5 8d
, if it gets to 8d
without understanding that it's part of a the previous character, something has already gone wrong.
Kichang Kim commented
It may be text file encoding issue.
If you can modify python code, test this fix:
def load_tags(tags_path):
with open(tags_path, "r") as tags_stream:
tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
return tags
to
def load_tags(tags_path):
with open(tags_path, "r", encoding="utf-8") as tags_stream:
tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
return tags