SEGV with spacy 3.0.3

Question

SEGV with spacy 3.0.3

bitPogo opened this issue 3 years ago · comments

Hey,

First - thanks for this awesome package and the work which was needed to build it.
However currently I am make a small annotation project, which uses spaCy (3.0.3) with this repo (0.2.0) and I am experience segfaults. torch is up to date (1.7.1) and I am running on python 3.9.2.
Also I did not experience this behavior on spaCy 2.x.
This also effects the nltk variant.

Nikita Kitaev · Answer 1 · Thu Feb 25 2021 04:28:12 GMT+0800 (China Standard Time)

I can't reproduce

$ conda create -n benepar-debug-env python=3.9
$ conda activate benepar-debug-env
$ pip install benepar spacy==3.0.3 torch==1.7.1
$ python
Python 3.9.1 (default, Dec 11 2020, 06:28:49)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import benepar
>>> parser = benepar.Parser('benepar_en3')
>>> parser.parse('"Fly safely."')
Tree('TOP', [Tree('S', [Tree('``', ['``']), Tree('VP', [Tree('VB', ['Fly']), Tree('ADVP', [Tree('RB', ['safely'])])]), Tree('.', ['.']), Tree("''", ["''"])])])

Could you post a sequence of commands like the ones above that reproduces the issue, starting from a clean python install? Anaconda preferred if possible, but instructions for downloading python binaries from elsewhere are fine too.

I'll also need to know what OS you're on, and whether you have a GPU or not.

bitPogo · Answer 2 · Thu Feb 25 2021 04:44:01 GMT+0800 (China Standard Time)

> uname -a
Linux xxx 5.10.0-1-amd64 #1 SMP Debian 5.10.4-1 (2020-12-31) x86_64 GNU/Linux
> echo $CUDA_VISIBLE_DEVICES
0

Actually your example fails for me.
I have a Docker container if it helps:

FROM debian:bullseye-slim

# ensure local python is preferred over distribution python
ENV PATH /usr/local/bin:$PATH
ENV LANG C.UTF-8
ENV CUDA_VISIBLE_DEVICES 0

# run install
RUN apt-get update
RUN apt-get -y dist-upgrade \
        && apt-get -y install build-essential checkinstall \
        && apt-get -y install libncursesw5-dev libssl-dev \
    libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev libffi-dev zlib1g-dev\
        wget liblzma-dev icu-devtools libicu-dev libicu67 git \
        && apt-get -y install python3.9 python3-pip python3-icu
COPY . /apc
WORKDIR /apc
RUN /usr/bin/python3 -m pip install --upgrade pip
RUN pip install --ignore-installed -r requirements.txt
RUN python3 -m spacy download en_core_web_sm
CMD ["python"]

requirements:

pandas
numpy
cpython
pytest
parameterized
coverage
pylint
nltk
spacy>=3.0.3
PyICU>=2.6
mypy
pydantic
benepar>=0.2.0

bitPogo · Answer 3 · Thu Feb 25 2021 04:50:59 GMT+0800 (China Standard Time)

I can also provide a valgrind output...but it's actually very verbose.
My guess is that some is not right with torch (but I did not figure out what so far).

Nikita Kitaev · Answer 4 · Thu Feb 25 2021 08:09:39 GMT+0800 (China Standard Time)

Have you tried using Anaconda instead of system (apt) python? Or running on CPU by setting something like CUDA_VISIBLE_DEVICES=-1?

I don't use docker or have it installed, so it will take me some time to replicate what you posted above. In the meantime it would be helpful to know if there's a specific aspect of the configuration that's causing issues.

Nikita Kitaev · Answer 5 · Thu Feb 25 2021 08:19:49 GMT+0800 (China Standard Time)

Also python 3.9.2 is less than a week old, so I wouldn't be surprised if the torch ecosystem has some issues with it. Actually I just did a search for torch segfaults on python 3.9, and found issues like this one. I suspect that dropping to python 3.8 or switching to the torch 1.8 release candidates might make the segfault go away.

bitPogo · Answer 6 · Thu Feb 25 2021 10:19:38 GMT+0800 (China Standard Time)

Anaconda is not an option for several reasons. Playing around with cuda did not help.

I downgraded meanwhile python to 3.9.1+, but it is also effecting python 3.8.7, since my CI is crashing, too.
Also I recall that I experienced random core dumps while testing under spaCy with benepar, but I did suspect pytest at the time.

I will try the torch rc later today. Also how the error occurs is relative specific - it look like that resources are not freed correctly (this includes a double free). From the valgrind this looks very suspicions:

==1847526== Invalid read of size 8
==1847526==    at 0x623CD1: PyThreadState_Clear (in /usr/bin/python3.9)
==1847526==    by 0x2F8C555B: pybind11::gil_scoped_acquire::dec_ref() (in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_python.so)
==1847526==    by 0x2F8C5598: pybind11::gil_scoped_acquire::~gil_scoped_acquire() (in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_python.so)
==1847526==    by 0x2FBED818: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) (in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_python.so)
==1847526==    by 0x140FDECF: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==1847526==    by 0x4875EA6: start_thread (pthread_create.c:477)
==1847526==    by 0x4B29DEE: clone (clone.S:95)
==1847526==  Address 0x42c2f208 is 3,752 bytes inside an unallocated block of size 3,824 in arena "client"

But it is hard to tell, since:

==1847526== HEAP SUMMARY:
==1847526==     in use at exit: 21,573,170 bytes in 28,321 blocks
==1847526==   total heap usage: 1,151,331 allocs, 1,123,010 frees, 1,309,228,117 bytes allocated
==1847526== 
==1847526== LEAK SUMMARY:
==1847526==    definitely lost: 0 bytes in 0 blocks
==1847526==    indirectly lost: 0 bytes in 0 blocks
==1847526==      possibly lost: 3,283,716 bytes in 350 blocks
==1847526==    still reachable: 18,281,918 bytes in 27,961 blocks
==1847526==                       of which reachable via heuristic:
==1847526==                         stdstring          : 483,533 bytes in 6,447 blocks
==1847526==                         newarray           : 48,152 bytes in 91 blocks
==1847526==         suppressed: 7,536 bytes in 10 blocks
==1847526== Rerun with --leak-check=full to see details of leaked memory
==1847526== 
==1847526== Use --track-origins=yes to see where uninitialised values come from
==1847526== For lists of detected and suppressed errors, rerun with: -s
==1847526== ERROR SUMMARY: 157907 errors from 694 contexts (suppressed: 0 from 0)