choderalab / perses

Experiments with expanded ensembles to explore chemical space

Home Page:http://perses.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Serialized XML objects in small molecule pipeline have the wrong extension

ijpulidos opened this issue · comments

When we serialize objects in

data.serialize(htf[phase].hybrid_system, trajectory_directory / "xml" /f"{phase}-hybrid-system.gz")
data.serialize(htf[phase]._old_system, trajectory_directory / "xml" / f"{phase}-old-system.gz")
data.serialize(htf[phase]._new_system, trajectory_directory / "xml" /f"{phase}-new-system.gz")

They are supposed to be gzipped files, but while inspecting these files one can easily tell they are directly the XML files. As in:

❯ file *
complex-hybrid-system.gz: XML 1.0 document, ASCII text
complex-new-system.gz:    XML 1.0 document, ASCII text
complex-old-system.gz:    XML 1.0 document, ASCII text
solvent-hybrid-system.gz: XML 1.0 document, ASCII text
solvent-new-system.gz:    XML 1.0 document, ASCII text
solvent-old-system.gz:    XML 1.0 document, ASCII text

Instead of the expected "gzip compressed data".

Do we want to have them zipped, or do we want to keep them uncompressed?

We do want to compress them.

Okay this was a fun one: https://github.com/choderalab/perses/blob/main/perses/utils/data.py#L114-L127

We do save the xml with gzip, but then since there is an if instead of elif when checking for bz2, we overwrite the file with an uncompressed version when we hit the else block. I've got a PR incoming with some extra debug that will help troubleshoot issues like this.