quodlibet / mutagen

Python module for handling audio metadata

Home Page:https://mutagen.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[bug] Encoding.LATIN1 returns wrong text with polish letters

superpawko opened this issue · comments

Maybe I'm wrong, I'm not so good in coding. Could you help me with this.
after:
tags = ID3(mp3, v2_version=3)
print(tags.getall("TIT2"))
I got this:
[TIT2(encoding=<Encoding.LATIN1: 0>, text=['Uciekaj¹ca ska³a'])]

In mp3tag program I see that everything is fine ( I see polish characters : Uciekająca skała )
It is ID3v2.3(Id3v1 Id3v2.3)

'TPE1': TPE1(encoding=<Encoding.LATIN1: 0>, text=['Roman Felczyñski'] is also broken, I have many files like this, I have no clue how to fix it. Thank you for your help.

Full tags object:
{'TIT2': TIT2(encoding=<Encoding.LATIN1: 0>, text=['Uciekaj¹ca ska³a']), 'PRIV:WM/MediaClassPrimaryID:¼}Ñ#ãâK\x86¡H¤*(D\x1e': PRIV(owner='WM/MediaClassPrimaryID', data=b'\xbc}\xd1#\xe3\xe2K\x86\xa1H\xa4*(D\x1e'), 'PRIV:WM/MediaClassSecondaryID:\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00': PRIV(owner='WM/MediaClassSecondaryID', data=b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'), 'TCON': TCON(encoding=<Encoding.LATIN1: 0>, text=['Przygodowy']), 'POPM:Windows Media Player 9 Series': POPM(email='Windows Media Player 9 Series', rating=255), 'TPE1': TPE1(encoding=<Encoding.LATIN1: 0>, text=['Roman Felczyñski'])}

"ą" and "ł" actually cannot be encoded in latin-1 / ISO-8859-1, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1 . I don't know what encoding MP3Tag is using there, I could not reproduce the exact outcome. Logical choices with regards to Polish letters would be ISO-8859-2 or on Windows maybe Windows-1250. But these would give:

>>> s = "Uciekająca skała"
>>> s.encode('iso-8859-2').decode('latin-1')
'Uciekaj±ca ska³a'
>>> s.encode('windows-1250').decode('latin-1')
'Uciekaj¹ca ska³a'

So a bit different result from yours. But anyway, both are not latin-1.

Is there any specific reason you can't use a Unicode encoding for the files?

I tried decode and encode myself before posting. And I got some errors. I don't know how to load it correctly or fix it. I have few TB database and a lot of files have this problem. MP3tag is getting it correctly. I thought maybe something during tags = ID3(mp3, v2_version=3) is not correct. Or can I fix it somehow later ?

Windows mp3 details view also show proper title and album name with polish letters.

edit: is this the same problem : #354 ?

I found that this is not Latin-1 But windows-1250.
I'm able to fix it with this code:
utitle = tags["TIT2"][0].encode('utf-8').decode('windows-1250').replace(u"Â", "")

But I have no clue how to detect it for rest of the files, because it is only for id3v1 files with latin-1 encoding. How can I check if TIT2 is encoded as Latin-1 ?

edit2: maybe this code:
str(tags.getall("TIT2")).find("encoding=<Encoding.LATIN1")

But I still think Mutagen could do this better, mp3tag does.

edit: is this the same problem : #354 ?

It's not, id3v2 has a known encoding stored in the file, which is likely wrong in your case.

edit2: maybe this code:
str(tags.getall("TIT2")).find("encoding=<Encoding.LATIN1")

tags["TIT2"][0].encoding == id3.Encoding.LATIN1 should work

But I still think Mutagen could do this better, mp3tag does.

mutagen currently doesn't second-guess encodings.

We could add something to the docs for starters with some examples though.