rtf reader: unichr() causes ValueError ()
joka opened this issue · comments
I have a rtf file with strange unicode strings (send you an email).
This causes rtf reader to throw ValueError:
* Module pyth.plugins.rtf15.reader, line 93, in read
* Module pyth.plugins.rtf15.reader, line 113, in go
* Module pyth.plugins.rtf15.reader, line 141, in parse
* Module pyth.plugins.rtf15.reader, line 369, in handle
* Module pyth.plugins.rtf15.reader, line 476, in handle_u
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
The reason why is, my python was build without support for "wide" Unicode characters. (http://www.python.org/dev/peps/pep-0261/). However, an exception handling would be nice.
This is a legitimate bug but I'm not sure what the correct fix is. It's possible to construct a surrogate pair to represent the character, with something like:
struct.pack('<L', 0x10000).decode('utf-32')
Which will "work", but perhaps cause other bugs down the line with e.g. slicing. So maybe we shouldn't do it.
Other alternatives include replacing it with '?', or just raising a different exception type. I haven't decided.
I like the solution: replacing with ? + log message
I've tried the struct trick above. It means that plugins should never trust len() of unicode strings, or slice them. But that's probably true anyway.