writing documents fails because of encoding errors when writing the log file

Question

writing documents fails because of encoding errors when writing the log file

fbergmann opened this issue 2 years ago · comments

I currently have the issue that the code does not allow writing out documents because writing the history failed:

enzymemlwriter.py:179, in EnzymeMLWriter._createArchive(self, enzmldoc, listofPaths, name)
    177 history_path = f"{self.path}/history.log"
    178 with open(history_path, "w") as f:
--> 179     f.write(enzmldoc.log.getvalue())
    181 self.addFileToArchive(
    182     archive=archive,
    183     file_path=history_path,
   (...)
    186     description="History of the EnzymeML document",
    187 )
    189 # add metadata to the experiment file

File cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\udcb7' in position 4532: character maps to <undefined>

It is important that the encoding is defined when writing the history. If you dont different operating systems will use different encodings.

Frank Bergmann · Answer 1 · Fri Mar 11 2022 18:25:20 GMT+0800 (China Standard Time)

actually, trying out different encodings, only to have them all fail, i think it would be best if you just write binary (using open mode "wb"), that way no matter what was written into the StringIO buffer, you can write it out.

Jan Range · Answer 2 · Fri Mar 11 2022 19:43:19 GMT+0800 (China Standard Time)

Could you provide an example to reproduce the error? On which OS did it happen? Want to make sure it's not an error from somewhere else.

Frank Bergmann · Answer 3 · Fri Mar 11 2022 20:07:45 GMT+0800 (China Standard Time)

the file, i sent earlier has the issue. The problem came about, since when i created the file earlier by running Cephalexin_Synthesis_Model4.ipynb the history.log created was written out as ANSI code (since the enzymeml writer did not specify an encoding, it takes the in this case windows native encoding). For example the file i mailed around earlier will have the problem.

Then if that file is read again by the enzymemlreader, and consequently written out by the enzymeml writer, the issue occurs.

Frank Bergmann · Answer 4 · Fri Mar 11 2022 20:21:41 GMT+0800 (China Standard Time)

so i recreated the file using the python notebook mentioned above with:

            with open(history_path, "w", encoding='utf-8') as f:
                f.write(enzmldoc.log.getvalue())

in the reader. Then i can roundtrip the files. (it will still fall over the old file that i mailed around earlier). So the only alternative is to write the file as binary, and manually encoding the string returned from the stringio object. there you then have the option of specifying what should happen in case of error:

strict - default response which raises a UnicodeDecodeError exception on failure
ignore - ignores the unencodable unicode from the result
replace - replaces the unencodable unicode to a question mark ?
xmlcharrefreplace - inserts XML character reference instead of unencodable unicode
backslashreplace - inserts a \uNNNN escape sequence instead of unencodable unicode
namereplace - inserts a \N{...} escape sequence instead of unencodable unicode

Jan Range · Answer 5 · Wed Mar 16 2022 18:13:00 GMT+0800 (China Standard Time)

Thanks for fixing that! I would propose that on failure an empty history file will be written. Otherwise the reader might thrown an error if there is none.