rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issues packaging with cx_freeze

NaaN108 opened this issue · comments

I'm currently trying to package an application that uses rapidfuzz using cx_freeze. The packaging is successful but when I try to run the application I get the following error.

implementation requires numpy to be installed

I'm using the process.cdist and I understand numpy is required for the matrix output however I also import numpy at the beginning of the module which calls rapidfuzz without issue so the dependency must be there.

basically what I would like to understand is if there is a way to test which sub modules from numpy are required but are failing to import?

Below is my setup.py

import sys
from cx_Freeze import setup, Executable
from setuptools import find_packages

options = {
    "build_exe": {
        "zip_include_packages": ["*"],
        "zip_exclude_packages": [],
        "build_exe": "dist\\",
        "includes": [
            "numpy",
            "numpy.int16",
            "numpy.int64",
            "_pytest._argcomplete",
            "_pytest._code.code",
            "_pytest._code.source",
            "_pytest._io.saferepr",
            "_pytest._io.terminalwriter",
            "_pytest._io.wcwidth",
            "_pytest._version",
            "_pytest.assertion.rewrite",
            "_pytest.assertion.truncate",
            "_pytest.assertion.util",
            "_pytest.cacheprovider",
            "_pytest.capture",
            "_pytest.compat",
            "_pytest.config.argparsing",
            "_pytest.config.compat",
            "_pytest.config.exceptions",
            "_pytest.config.findpaths",
            "_pytest.debugging",
            "_pytest.deprecated",
            "_pytest.doctest",
            "_pytest.faulthandler",
            "_pytest.fixtures",
            "_pytest.freeze_support",
            "_pytest.helpconfig",
            "_pytest.hookspec",
            "_pytest.junitxml",
            "_pytest.legacypath",
            "_pytest.logging",
            "_pytest.main",
            "_pytest.mark.expression",
            "_pytest.mark.structures",
            "_pytest.monkeypatch",
            "_pytest.nodes",
            "_pytest.nose",
            "_pytest.outcomes",
            "_pytest.pastebin",
            "_pytest.pathlib",
            "_pytest.pytester",
            "_pytest.pytester_assertions",
            "_pytest.python",
            "_pytest.python_api",
            "_pytest.python_path",
            "_pytest.recwarn",
            "_pytest.reports",
            "_pytest.runner",
            "_pytest.scope",
            "_pytest.setuponly",
            "_pytest.setupplan",
            "_pytest.skipping",
            "_pytest.stash",
            "_pytest.stepwise",
            "_pytest.terminal",
            "_pytest.threadexception",
            "_pytest.timing",
            "_pytest.tmpdir",
            "_pytest.unittest",
            "_pytest.unraisableexception",
            "_pytest.warning_types",
            "_pytest.warnings",
            "py._builtin",
            "py._path.local",
            "py._io.capture",
            "py._io.saferepr",
            "py._io.terminalwriter",
            "py._xmlgen",
            "py._error",
            "py._std",
            # builtin files imported by pytest using py.std implicit mechanism
            "argparse",
            "shlex",
            "warnings",
            "types",
            "rapidfuzz.utils_cpp",
            "rapidfuzz.utils_py",
            "rapidfuzz.process_py",
            "rapidfuzz.fuzz_py",
            "rapidfuzz.distance.Hamming_py",
            "rapidfuzz.process_cpp",
            "rapidfuzz.fuzz_cpp",
            "rapidfuzz.distance.Levenshtein_cpp",
            "rapidfuzz.distance.Levenshtein_py",
            "rapidfuzz.string_metric_cpp",
            "rapidfuzz.string_metric_py",
            "jinja2.ext",
            "jinja2",
        ],
        "include_files": ["tests/"],
    }
}


f = open("README.md", "r")
LONG_DESCRIPTION = f.read()
f.close()

setup(
    name="centralized_integrations",
    version="0.1",
    description="xxx",
    long_description=LONG_DESCRIPTION,
    long_description_content_type="text/markdown",
    author="xxx",
    author_email="xxx",
    url="xxx",
    license="BSD 3-Clause License",
    packages=find_packages(exclude=["ez_setup"]),
    options=options,
    include_package_data=True,
    executables=[Executable("cli/main.py", base=None)],
    entry_points="""
        [console_scripts]
        cli = cli.main:main
    """,
)

so just an update, after downgrading my rapidfuzz version to 2.0.0 everything ran successfully, I'm not closing the issue in case anyone has an idea of what the cause is in which case I'd be happy to try and fix it. However if the maintainer feels it should be closed I fully understand.

Since this is an issue with the latest version it should not be closed. There are two things you could do to understand the issue better:

  1. instead of from rapidfuzz.process import cdist use from rapidfuzz.process_cdist_cpp import cdist on the latest version to see the exact error it raises when importing (I should really add this to the error message as well)
  2. try with versions between 2.0.0 and latest which one introduced the issue

ok so I've tracked down the breaking change to version 2.1.1 then I tried changing the import to from rapidfuzz.process_cdist_cpp import cdist which actually worked, I then tried from rapidfuzz.process_cdist_py import cdist which also worked so errors only seem to be thrown when running from rapidfuzz.process import cdist this would suggest the issue has to do with the fallback_import function

ok I've found the issue, there's nothing wrong with the codebase but the addition of fallback_import caused cx_freeze's dynamic import loading to break meaning imports needed to be specified manually. In this case the import of rapidfuzz.process_cdist_py and rapidfuzz.process_cdist_cpp. However since the default import error message is implementation requires numpy to be installed this is what was returned even though numpy was available. After specifying the two aforementioned modules in setup.py everything was successful.

This part of the code will need to be changed in one of the next versions again, since there is still an issue with the pure Python fallback. Currently the library would still fail to install if one of the C dependencies fails to install. The current plan is to split the package into two separate packages: rapidfuzz and rapidfuzz-cpp. It will then be possible to install either rapidfuzz or rapidfuzz[speedup]. rapidfuzz[speedup] means it should always install the faster version, while it is an error if it fails to install. rapidfuzz means it should install the faster implementation if possible, but fallback to the slower version if this is not possible. Unfortunately there is no feature like this in Python packaging so the best I can do for this version is to install rapidfuzz-cpp when I know it provides wheels for the platform.
Afterwards I would like to remove this fallback_import behavior again, since it should be better for a user to specify rapidfuzz[speedup] as dependency instead of setting a random environment variable. @Rongronggg9 what is your take on this, since your likely the only user of this environment variable. Would this be sufficient?

In addition it would probably make sense to change the cdist import to something like:

except ImportError as e:
    def cdist(*args, **kwargs):
        raise NotImplementedError() from e

Afterwards I would like to remove this fallback_import behavior again, since it should be better for a user to specify rapidfuzz[speedup] as dependency instead of setting a random environment variable.

I am not opposed to this. But I doubt if pure-Python rapidfuzz still fits its name... It is a huge breaking change. If someone never specifies a specific version of rapidfuzz in their dependencies, after they set up a new environment, a "rapid" dependency slows things down...

after they set up a new environment, a "rapid" dependency slows things down...

On platforms like e.g. for Linux x86_64 I know that there is a wheel for rapidfuzz-cpp and so I will always install it. So this only affects more niche platforms, where no official wheel standard exists like e.g. open bsd. However currently the build requires e.g. oldest-supported-numpy and I do not know whether pip install oldest-supported-numpy even works on open bsd.
I am open to different suggestions though.

The general goal with this split are:

  1. currently the installation fails when the installation of build dependencies fails. Even though those build dependencies fails (cmake, ninja, oldest-supported-numpy).
  2. make the forced installation of the faster version simpler

My preferred solution would be something along the lines of:

optional_install_required = ["rapidfuzz-cpp"]

which would try to install the faster implementation but ignore any installation failure. Maybe something like https://stackoverflow.com/a/53568049/11335032 would work better.

OK then.

There is still something that can be improved. Practically speaking, speedup is not a good extra_require name since the only difference it makes is to throw an error if fails to install the cpp implementation - it itself does not speed up anything. Maybe ensure_speed or ensure_cpp is better.
Pip comes with no "optional but default dependency" support, and workarounds in https://stackoverflow.com/a/53568049 are all dirty hacks and should not be taken into production - in production, all dependencies should be installed at once since: 1) pip can not ensure parallel or nested pip runs won't break anything, 2) pip would not fail if the current installation breaks the dependencies relation of previously installed packages. Though both the package and its "optional but default dependency" are in your control, the dirty-hacking behavior would probably not break anything, I am still not so in favor of this workaround.

Yes you are probably right. I really hope pip adds "optional but default dependency" support at some point. The best we can do without this hack is to ensure the faster implementation for major platforms, which have official wheel support and on other platforms rely on users either using something like ensure_cpp or installing from a different package manager which provides them with wheels / the dependency. (e.g. packages listed here https://pkgs.org/search/?q=rapidfuzz could add a dependency on python-rapidfuzz-cpp)

One solution that should work (but cause quite a bit of extra work for me), would be the addition of optional packages like e.g. optional-cmake which would just install an empty package if the installation fails. This would be reasonably simple for the cmake and ninja package, but would be a major pain with the numpy dependency. For cmake + ninja it would probably be possible to install them if wheels are available and otherwise rely on system cmake, since installing cmake from an sdist requires system cmake anyways.

I am generally pretty unhappy with the numpy dependency and still contemplating, whether I should write my own matrix implementation that a user can convert using np.asarray in v3.0.0

Edit: removing the numpy build dependency proves to be very simple: #239
in a backwards compatible manner. This has the additional advantage, that I can finally release wheels for musllinux.

@Rongronggg9 now that I removed the Cython build dependency and made the installation of cmake/ninja more optional I think we should be fine.
@NaaN108 Can you retry this with the latest release? It still uses the fallback, but gets rid of the try except around it for cdist. Does this improve things?

This should produce a more obvious error now