castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Home Page:http://pyserini.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature request: docker build for portability

sueszli opened this issue · comments

Machine specifications

  • Model Name: MacBook Pro
  • Model Identifier: Mac14,10
  • Chip: Apple M2 Pro
  • Total Number of Cores: 12 (8 performance and 4 efficiency)
  • Memory: 16 GB
  • System Firmware Version: 10151.1.1
  • OS Loader Version: 10151.1.1
  • openjdk 20.0.1 2023-04-18, OpenJDK Runtime Environment (build 20.0.1+9-29), OpenJDK 64-Bit Server VM (build 20.0.1+9-29, mixed mode, sharing)
  • Python 3.11.6
  • conda 23.7.4

Bug

The following were my unsuccessful attempts at installing Pyserini.

  1. Direct install:
    • “nmslib” doesn’t support apple silicon / ARM chips. None of the workarounds I found (ie. CFLAGS="-mavx -DWARN(a)=(a)" pip install --use-pep517 nmslib) worked for me.
    • My manual tests with Pyserini were always correct, but not reproducable.
    • When running unit tests without the “nmslib” library, the jvm-python interopt broke. I ran 298 tests in 187.622s (failures=6, errors=67).
  2. Anaconda environment:
    • I managed to reproduce the majority of manual tests from the docs, but the unit test cases (python -m unittest) always either failed or timed out.

    • I get 2 mysterious stack canary warnings from the JVM every single time I run any instruction in Pyserini:

      [0.002s][warning][os,thread] Attempt to protect stack guard pages failed (0x0000000169000000-0x000000016900c000).
      [0.002s][warning][os,thread] Attempt to deallocate stack guard pages failed.
      

To clarify: I'm able to install and use pyserini but it frequently crashes and always prints stack guard error messages.

Feature request

The current build process relies on very specific:

  • language runtimes (Both for Java and Python)
  • environments (Conda for Python, or it doesn't build on ARM based Macs)
  • compiled binaries from external libraries that you have to manually move (Anserini)

I believe that writing a single Dockerfile could resolve all these issues.

-> Feel free to assign the implementation to me if you think it's reasonable.

Update: Most errors seem to be from "pyserini/encode":

Traceback (most recent call last):

File "/opt/homebrew/anaconda3/envs/pyserini/lib/python3.10/runpy.py", line 196, in _run_module_as_main

return _run_code(code, main_globals, None,

File "/opt/homebrew/anaconda3/envs/pyserini/lib/python3.10/runpy.py", line 86, in _run_code

exec(code, run_globals)

File "/Users/sueszli/dev/pyserini/pyserini/encode/__main__.py", line 21, in <module>

from pyserini.encode import DprDocumentEncoder, TctColBertDocumentEncoder, AnceDocumentEncoder, AggretrieverDocumentEncoder, AutoDocumentEncoder, CosDprDocumentEncoder

ImportError: cannot import name 'CosDprDocumentEncoder' from 'pyserini.encode' (/Users/sueszli/dev/pyserini/pyserini/encode/__init__.py)

i taught myself docker to test this.

using docker turned out to be a bad idea because:

  • docker is not free in a commercial setting

  • learning docker takes time and it adds additional overhead to the team

  • it's difficult to interact with the container:

    • you either have to be in the docker shell inside the isolated container: docker exec -it
    • or you have to build a dedicated network API to send requests over localhost
  • you can't compile anserini inside docker with java 11 (i tried all kinds of jdks) – see: castorini/anserini#2282 (comment)

Also, if you're going to do development on Pyserini, it makes little sense to do it inside Docker. You can't avoid going through the pain of setting up your local Python env.