brendonh / pyth

Python text markup and conversion

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rtf reader: unichr() causes ValueError ()

joka opened this issue · comments

I have a rtf file with strange unicode strings (send you an email).

This causes rtf reader to throw ValueError:

* Module pyth.plugins.rtf15.reader, line 93, in read
* Module pyth.plugins.rtf15.reader, line 113, in go
* Module pyth.plugins.rtf15.reader, line 141, in parse
* Module pyth.plugins.rtf15.reader, line 369, in handle
* Module pyth.plugins.rtf15.reader, line 476, in handle_u
ValueError: unichr() arg not in range(0x10000) (narrow Python build) 

The reason why is, my python was build without support for "wide" Unicode characters. (http://www.python.org/dev/peps/pep-0261/). However, an exception handling would be nice.

This is a legitimate bug but I'm not sure what the correct fix is. It's possible to construct a surrogate pair to represent the character, with something like:

struct.pack('<L', 0x10000).decode('utf-32')

Which will "work", but perhaps cause other bugs down the line with e.g. slicing. So maybe we shouldn't do it.

Other alternatives include replacing it with '?', or just raising a different exception type. I haven't decided.

I like the solution: replacing with ? + log message

I've tried the struct trick above. It means that plugins should never trust len() of unicode strings, or slice them. But that's probably true anyway.