idlesign / torrentool

The tool to work with torrent files.

Home Page:https://github.com/idlesign/torrentool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrectly parsed Unicode char

euamotubaina opened this issue · comments

This private tracker torrent file has a file path which includes an unicode character that's being incorrectly parsed

\x008D chr(189) Vulgar Fraction One Half

I noticed it because after loading the file with the Torrent class, the calculated info_hash was different from the original torrent.

Screenshots of original torrent file and a new one created with Torrent.to_file from the same data in the hex editor

Original:
Screenshot 2023-09-26 142545

Created with Torrent class
Screenshot 2023-09-26 142612

When using the Bencode class to read and write the torrent, the char is correctly parsed and the hashes match.

Here's a version of the original torrent without the tracker url

431f76f60e05250df162c90a73ab8377dc4ca9c8.zip

screenshot of the terminal output when reading the file with Torrent class (the file name is the correct sha1 hash)
Screenshot 2023-09-26 151205

EF BF BD means that filename contains non-utf symbol, we've tried and parsed as utf-8.
What's the encoding used in your filesystem for filenames?

I'm on Windows 11, which uses unicode to encode file paths, if I understood correctly.

I think this specific torrent used latin-1 encoding for the file paths, so I guess this is very much a corner case

Screenshot 2023-09-27 123011

I think this specific torrent used latin-1 encoding for the file paths, so I guess this is very much a corner case

Hm, latin-1... This comment seems to be relevant
#2 (comment)