EnzymeML / PyEnzyme

🧬 - Data management and modeling framework based on EnzymeML.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

writing documents fails because of encoding errors when writing the log file

fbergmann opened this issue · comments

I currently have the issue that the code does not allow writing out documents because writing the history failed:

enzymemlwriter.py:179, in EnzymeMLWriter._createArchive(self, enzmldoc, listofPaths, name)
    177 history_path = f"{self.path}/history.log"
    178 with open(history_path, "w") as f:
--> 179     f.write(enzmldoc.log.getvalue())
    181 self.addFileToArchive(
    182     archive=archive,
    183     file_path=history_path,
   (...)
    186     description="History of the EnzymeML document",
    187 )
    189 # add metadata to the experiment file

File cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\udcb7' in position 4532: character maps to <undefined>

It is important that the encoding is defined when writing the history. If you dont different operating systems will use different encodings.

actually, trying out different encodings, only to have them all fail, i think it would be best if you just write binary (using open mode "wb"), that way no matter what was written into the StringIO buffer, you can write it out.

Could you provide an example to reproduce the error? On which OS did it happen? Want to make sure it's not an error from somewhere else.

the file, i sent earlier has the issue. The problem came about, since when i created the file earlier by running Cephalexin_Synthesis_Model4.ipynb the history.log created was written out as ANSI code (since the enzymeml writer did not specify an encoding, it takes the in this case windows native encoding). For example the file i mailed around earlier will have the problem.

Then if that file is read again by the enzymemlreader, and consequently written out by the enzymeml writer, the issue occurs.

so i recreated the file using the python notebook mentioned above with:

            with open(history_path, "w", encoding='utf-8') as f:
                f.write(enzmldoc.log.getvalue())

in the reader. Then i can roundtrip the files. (it will still fall over the old file that i mailed around earlier). So the only alternative is to write the file as binary, and manually encoding the string returned from the stringio object. there you then have the option of specifying what should happen in case of error:

  • strict - default response which raises a UnicodeDecodeError exception on failure
  • ignore - ignores the unencodable unicode from the result
  • replace - replaces the unencodable unicode to a question mark ?
  • xmlcharrefreplace - inserts XML character reference instead of unencodable unicode
  • backslashreplace - inserts a \uNNNN escape sequence instead of unencodable unicode
  • namereplace - inserts a \N{...} escape sequence instead of unencodable unicode

Thanks for fixing that! I would propose that on failure an empty history file will be written. Otherwise the reader might thrown an error if there is none.