rtf reader: nonasci metadata causes UnicodeDecodeError (openoffice rtf files)

Question

rtf reader: nonasci metadata causes UnicodeDecodeError (openoffice rtf files)

joka opened this issue 14 years ago · comments

I have openoffice rtf files with nonasci metadata (author):

{\info{\author Claudia Jürgens}{\creatim\yr2010\mo7\dy19\hr12\min45}{\author Claudia Jürgens}
{\revtim\yr2010\mo7\dy28\hr13\min27}{\printim\yr0\mo0\dy0\hr0\min0}{\comment    
StarWriter}{\vern3000}}\deftab709

This causes UnicodeDecodeError:

Module pyth.plugins.rtf15.reader, line 93, in read
Module pyth.plugins.rtf15.reader, line 113, in go                                           
Module pyth.plugins.rtf15.reader, line 147, in parse
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128)

This patch just catches the error:

*** reader.py  2010-05-04 21:48:14.000000000 +0200 
--- reader.py   2010-08-04 21:47:10.000000000 +0200 
***************       
*** 140,146 ****      
                  control, digits = self.getControl() 
                  self.group.handle(control, digits) 
              else:   
!                 self.group.char(unicode(next)) 


      def getControl(self): 
--- 140,149 ----      
                  control, digits = self.getControl() 
                  self.group.handle(control, digits) 
              else:   
!                 try: 
!                     self.group.char(unicode(next)) 
!                 except UnicodeDecodeError, e: 
!                     self.group.char('?') 


      def getControl(self):

Brendon Hogger · Answer 1 · Sat Aug 14 2010 23:09:12 GMT+0800 (China Standard Time)

Hi joka,

As with the \f0 issue, please send me a full RTF file to reproduce this, and I'll see if I can figure out the best fix.

Brendon Hogger · Answer 2 · Thu Aug 19 2010 05:31:46 GMT+0800 (China Standard Time)

Fixed (in trunk) by decoding the char in the current group using its charset (i.e. the doc default charset for metadata), rather than blindly unicode()ing it.