biopython / biopython

Official git repository for Biopython (originally converted from CVS)

Home Page:http://biopython.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

two test failures affecting Bio.SeqIO.SeqXmlIO in Debian sid

emollier opened this issue · comments

Greetings,

This issue may not be an easy one, and even risks being Debian
specific, so if you are busy somewhere else, then I won't mind.
:)

I'm having great difficulties pinpointing a change between
Debian testing and sid which is causing Debian bug #1064147,
affecting both Biopython 1.81 and 1.83 in sid, but none of them
in testing (as of today). The error output comes from the test
suite, and gives in Debian sid:

======================================================================
ERROR: test_embl7 (test_SeqIO.TestSeqIO.test_embl7)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 3406, in test_embl7
   self.perform_test(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 625, in perform_test
   self.check_simple_write_read(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 362, in check_simple_write_read
   records2 = list(SeqIO.parse(handle=handle, format=fmt))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/Interfaces.py", line 85, in __next__
   return next(self.records)
          ^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 482, in iterate
   parser.close()
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 240, in close
   self.feed(b"", isFinal=True)
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 217, in feed
   self._parser.Parse(data, isFinal)
 File "../Modules/pyexpat.c", line 416, in StartElement
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 369, in start_element_ns
   self._cont_handler.startElementNS(pair, None,
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 184, in startEntryFieldElement
   return self.startPropertyElement(attrs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 374, in startPropertyElement
   record = self.records[-1]
            ~~~~~~~~~~~~^^^^
IndexError: list index out of range

======================================================================
ERROR: test_genbank8 (test_SeqIO.TestSeqIO.test_genbank8)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 2803, in test_genbank8
   self.perform_test(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 625, in perform_test
   self.check_simple_write_read(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 362, in check_simple_write_read
   records2 = list(SeqIO.parse(handle=handle, format=fmt))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/Interfaces.py", line 85, in __next__
   return next(self.records)
          ^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 482, in iterate
   parser.close()
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 240, in close
   self.feed(b"", isFinal=True)
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 217, in feed
   self._parser.Parse(data, isFinal)
 File "../Modules/pyexpat.c", line 416, in StartElement
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 369, in start_element_ns
   self._cont_handler.startElementNS(pair, None,
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 184, in startEntryFieldElement
   return self.startPropertyElement(attrs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 374, in startPropertyElement
   record = self.records[-1]
            ~~~~~~~~~~~~^^^^
IndexError: list index out of range

Version comparison of Biopython direct dependencies suggest they
are not involved in the test regression I observe:

  • python3-numpy version in sid: 1:1.24.2-2
  • python3-numpy version in testing: 1:1.24.2-2
  • python3 version in sid: 3.11.6-1
  • python3 version in testing: 3.11.6-1
  • libc6 version in sid: 2.37-15
  • libc6 version in testing: 2.37-15
  • python3-reportlab version in sid: 4.1.0-1
  • python3-reportlab version in testing: 4.1.0-1
  • w3c-sgml-lib version in sid: 1.3-3
  • w3c-sgml-lib version in testing: 1.3-3

This means that the issue is caused by a transitive dependency
(which I have not managed to identify yet), or something else
entirely. I also reproduced the problem on Debian Salsa CI,
which strongly hints that I haven't mishandled my build
environment. Besides, if one of the direct dependencies had
triggered the bug, then I would have expected the test run
triggered by the package migration CI to have tripped, which
has not happened (which hints that the situation did not appear
before 2024 February 3rd in sid).

I wouldn't be surprised to learn that this isn't be a problem in
Biopython per se, but I'm running out of options without your
thoughts upstream (maybe the combination of these two particular
tests failing and the others passing hints to something in
particular?). Do you per chance have an idea of what I am
missing that could cause the SeqXmlIO records to be empty and
cause these test failures?

Thank you for your time,
Étienne.

PS: here below, the mandatory form for your convenience:

Setup

I am reporting a problem with Biopython version, Python version,
and operating system as follows. In Debian sid:

>>> import sys; print(sys.version)
3.11.8 (main, Feb  7 2024, 21:52:08) [GCC 13.2.0]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-6.6.15-amd64-x86_64-with-glibc2.37
>>> import Bio; print(Bio.__version__)
1.83

In Debian testing (no changes visibly):

>>> import sys; print(sys.version)
3.11.8 (main, Feb  7 2024, 21:52:08) [GCC 13.2.0]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-6.6.15-amd64-x86_64-with-glibc2.37
>>> import Bio; print(Bio.__version__)
1.83

This is also valid with Biopython 1.81.

Expected behaviour

I would like to make all test items of the test suite pass on
Debian sid.

Actual behaviour

All test items pass on Debian testing. However on Debian sid, I
observe the following test items failing:

======================================================================
ERROR: test_embl7 (test_SeqIO.TestSeqIO.test_embl7)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 3406, in test_embl7
   self.perform_test(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 625, in perform_test
   self.check_simple_write_read(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 362, in check_simple_write_read
   records2 = list(SeqIO.parse(handle=handle, format=fmt))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/Interfaces.py", line 85, in __next__
   return next(self.records)
          ^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 482, in iterate
   parser.close()
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 240, in close
   self.feed(b"", isFinal=True)
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 217, in feed
   self._parser.Parse(data, isFinal)
 File "../Modules/pyexpat.c", line 416, in StartElement
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 369, in start_element_ns
   self._cont_handler.startElementNS(pair, None,
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 184, in startEntryFieldElement
   return self.startPropertyElement(attrs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 374, in startPropertyElement
   record = self.records[-1]
            ~~~~~~~~~~~~^^^^
IndexError: list index out of range

======================================================================
ERROR: test_genbank8 (test_SeqIO.TestSeqIO.test_genbank8)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 2803, in test_genbank8
   self.perform_test(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 625, in perform_test
   self.check_simple_write_read(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 362, in check_simple_write_read
   records2 = list(SeqIO.parse(handle=handle, format=fmt))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/Interfaces.py", line 85, in __next__
   return next(self.records)
          ^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 482, in iterate
   parser.close()
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 240, in close
   self.feed(b"", isFinal=True)
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 217, in feed
   self._parser.Parse(data, isFinal)
 File "../Modules/pyexpat.c", line 416, in StartElement
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 369, in start_element_ns
   self._cont_handler.startElementNS(pair, None,
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 184, in startEntryFieldElement
   return self.startPropertyElement(attrs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 374, in startPropertyElement
   record = self.records[-1]
            ~~~~~~~~~~~~^^^^
IndexError: list index out of range

Steps to reproduce

Run the test suite.

That is odd. I've only skimmed the tracebacks, however I can give some context.

Both the failures are within check_simple_write_read which takes an existing sequence file (here one in EMBL format, and one in the related GenBank format), exports it as another format, and confirms we can read it back again with no unexpected data loss.

This suggests to me the issue is either in writing out the SeqXML format, or in reading SeqXML format. Libraries this code imports to do that:

from xml import sax
from xml.sax import handler
from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesImpl

Are there any likely looking XML SAX libraries whose version changed?

It might be helpful to try the SeqXML output directly, and validate the XML file. e.g.

$ python -c "from Bio import SeqIO; SeqIO.convert('Tests/EMBL/SC10H5.embl', 'embl', 'check_me.xml', 'seqxml')"

The other thing I would double check is the system locale (just because that has caused oddities before).

Hi Peter,

Are there any likely looking XML SAX libraries whose version changed?

I looked there first, but the xml.sax looks to be part of the standard library, which is tightly coupled to the python3 interpreter:

$ apt-file search /xml/sax/xmlreader.py
libpython3.10-stdlib: /usr/lib/python3.10/xml/sax/xmlreader.py
libpython3.11-stdlib: /usr/lib/python3.11/xml/sax/xmlreader.py
libpython3.12-stdlib: /usr/lib/python3.12/xml/sax/xmlreader.py
[... snipped various irrelevant stdlib embeddings ...]

So the issue would have been in an external dependency of the module. Skimming through python3.11 shows nothing standing out; dependencies are merely the compiler and libc.

The libc hasn't changed, so, urgh, I'm hoping I'm not running into a long winded compiler bug, the only difference I see that might affect xml.sax operation between testing and sid is gcc (through libgcc_s.so.1):

  • testing has gcc 14-20240201 (beginning of the month)
  • sid has gcc 14-20240221 (end of the month)

Possibly relevant milestone would be gcc 14-20240207, as I know for sure I witnessed the issue before Berlin sprint on 16th and 17th. If further testing prove this happens to be the relevant change, then I will move back the discussion in Debian and close the present issue.

It might be helpful to try the SeqXML output directly, and validate the XML file. e.g.

$ python -c "from Bio import SeqIO; SeqIO.convert('Tests/EMBL/SC10H5.embl', 'embl', 'check_me.xml', 'seqxml')"

Thanks for the snippet, for what it's worth I obtained the same output from either testing or sid. I guess this might point towards a problem with reading back the SeqXML? The file obtained out of testing is in attachment if need be.

testing.xml.gz

The other thing I would double check is the system locale (just because that has caused oddities before).

I think we've been bitten by that in the past. Nowadays we run the test suite enforcing LC_ALL=C.UTF-8. This is probably not the issue.

Thank you for your time,
Étienne.

Re,

If further testing prove [libgcc_s.so.1] happens to be the relevant change, then I will move back the discussion in Debian and close the present issue.

I reran the test suite in Debian testing with a targeted upgrade of the libgcc_s.so.1 from sid, and tests passed, so this is not related to the compiler library.

If you have a snippet to reread the output of the snippet that produced the xml, I would be taker.

Have a nice day, :)
Étienne.

Well a one line to parse the XML file back again could be:

$ python -c "from Bio import SeqIO; print(SeqIO.read('check_me.xml', 'seqxml').id)"
AL031232

Also worth a try is:

$ cd Tests/
$ python test_SeqIO.py

If that works (outside of run_tests.py) that is rather puzzling. If it fails, we might start cutting test_SeqIO.py down to a smaller and smaller test.

Thanks for the reader, it looks like the issue stems solely from the reading, I see on sid:

$ python3 -c "from Bio import SeqIO; print(SeqIO.read('check_me.xml', 'seqxml').id)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/Bio/SeqIO/__init__.py", line 658, in read
    next(iterator)
  File "/usr/lib/python3/dist-packages/Bio/SeqIO/Interfaces.py", line 72, in __next__
    return next(self.records)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 447, in iterate
    parser.close()
  File "/usr/lib/python3.11/xml/sax/expatreader.py", line 240, in close
    self.feed(b"", isFinal=True)
  File "/usr/lib/python3.11/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
  File "../Modules/pyexpat.c", line 416, in StartElement
  File "/usr/lib/python3.11/xml/sax/expatreader.py", line 369, in start_element_ns
    self._cont_handler.startElementNS(pair, None,
  File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 163, in startEntryFieldElement
    return self.startPropertyElement(attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 339, in startPropertyElement
    record = self.records[-1]
             ~~~~~~~~~~~~^^^^
IndexError: list index out of range

while on testing I have the expected result:

$ python3 -c "from Bio import SeqIO; print(SeqIO.read('check_me.xml', 'seqxml').id)"
AL031232

Ouch, by the way, I had a look back at my build log from yesterday, and I realized my gcc upgrade attempt didn't apply cleanly, so this is still a variable to consider.

Progress for sure - and the SAX expat C code is involved somehow.

Looks like the NixOS people ran into this as well: NixOS/nixpkgs#292138

It took me a while, but I finally managed to run the test in a Debian testing environment with upgraded libgcc_s.so.1. The recipe was to upgrade the following packages in a lock step:

  • libgcc-s1
  • gcc-14-base
  • libstdc++6
  • libgomp1
  • libgfortran5

They all stem from Gcc 14 version 14-20240303. In the end, the SeqIO test passed in testing with upgraded Gcc libraries.

Just in case, I rechecked the status in Debian sid, in which I still witness the failure. Now we know the issue is not from Gcc 14-20240303.

I must be very tired for not having looked at libexpat1 earlier: the package looks pulled by the "essential set" and thus does not need explicit dependency set (which is why I completely missed it); also Python source code embeds a copy of expat, but Debian builds Python against the operating system's libexpat.so.1.

I do reproduce the issue by upgrading testing's libexpat.so.1 from 2.5.0 in testing to sid's 2.6.1. I believe the issue is not in Biopython, but in the interaction between Python and the newer expat library.

Debian CI is tripping on Python test errors following libexpat 2.6.1 upgrade in sid, suggesting it won't make it to testing until tests are resolved in Python, fixing possible the Biopython issue at the same time. From Python test attempt against the newer expat:

2444s 0:25:01 load avg: 0.96 [414/427/1] test_xml_etree failed (2 failures)
2444s test test_xml_etree failed -- multiple errors occurred; run in verbose mode for details
2446s 0:25:03 load avg: 0.97 [415/427/2] test_xml_etree_c failed (2 failures)
2446s test test_xml_etree_c failed -- multiple errors occurred; run in verbose mode for details

I think it is safe to close the present issue as it is not a problem in Biopython per se.

Impressive detective work there. We can hope that Python's libexpat code gets fixed before too long (given you said the Python tests are failing), so hopefully nothing for the Biopython to do here. Thank you all!