rtf reader: nonasci metadata causes UnicodeDecodeError (openoffice rtf files)
joka opened this issue · comments
Joscha Krutzki commented
I have openoffice rtf files with nonasci metadata (author):
{\info{\author Claudia Jürgens}{\creatim\yr2010\mo7\dy19\hr12\min45}{\author Claudia Jürgens}
{\revtim\yr2010\mo7\dy28\hr13\min27}{\printim\yr0\mo0\dy0\hr0\min0}{\comment
StarWriter}{\vern3000}}\deftab709
This causes UnicodeDecodeError:
Module pyth.plugins.rtf15.reader, line 93, in read
Module pyth.plugins.rtf15.reader, line 113, in go
Module pyth.plugins.rtf15.reader, line 147, in parse
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128)
This patch just catches the error:
*** reader.py 2010-05-04 21:48:14.000000000 +0200
--- reader.py 2010-08-04 21:47:10.000000000 +0200
***************
*** 140,146 ****
control, digits = self.getControl()
self.group.handle(control, digits)
else:
! self.group.char(unicode(next))
def getControl(self):
--- 140,149 ----
control, digits = self.getControl()
self.group.handle(control, digits)
else:
! try:
! self.group.char(unicode(next))
! except UnicodeDecodeError, e:
! self.group.char('?')
def getControl(self):
Brendon Hogger commented
Hi joka,
As with the \f0 issue, please send me a full RTF file to reproduce this, and I'll see if I can figure out the best fix.
Brendon Hogger commented
Fixed (in trunk) by decoding the char in the current group using its charset (i.e. the doc default charset for metadata), rather than blindly unicode()ing it.