polm / fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clean up setup.py

garfieldnate opened this issue · comments

Hi,

this might be the same as #42 and #44, but the platform is different. I'm unable to load the .so file through fugashi when running from the base Docker image for Python 3.10. Minimal breaking example below:

Dockerfile:

FROM python:3.10.1-bullseye
RUN pip install -vvv fugashi
RUN echo "import fugashi" | python

Output of running docker build 2>&1 ..

If you know a quick workaround for this, I'd be happy to use that for now. I'd be happy to help if you need a hand with this, as well.

Quick workaround:

RUN apt-get update

This ensures that libmecab-dev can be downloaded directly instead of needing to be compiled.

That fixes the issue for me, though I still wonder why compiling from source did not work properly.

Actually, I think the more important issue is that installing the module didn't fail; it would have been better if this were an install-time failure, rather than a runtime failure.

I see the error you're getting but I'm also not sure what would cause it. It might be something simple like needing to run ldconfig after compiling the library.

Because shared libs are resolved at runtime in general, there's no reasonable way to make this an install failure instead of a runtime one.

Sorry to bother you with this; I think now that I was simply running out of memory during the compilation phase. But I'm still unsure why the build completed successfully without finding mecab. Wouldn't it be beneficial to raise an error at the end of check_libmecab in fugashi_util.py instead of returning empty lists?

Wouldn't it be beneficial to raise an error at the end of check_libmecab in fugashi_util.py instead of returning empty lists?

I guess, but I'm not sure because that means you couldn't ever run setup.py without a working installation.

Sorry to bother you with this if this seems like a silly issue 😅

you couldn't ever run setup.py without a working installation.

Exactly. Maybe I want it to be more black-boxy than you intended, but the way I expect pip install fugashi to work is that 1) before install I do not have mecab and have no idea how to get mecab, and 2) after install everything works and I don't have to worry about whether mecab is installed. I definitely do not expect 3) fugashi installed successfully but actually the mecab install failed so it doesn't work.

Exactly. Maybe I want it to be more black-boxy than you intended, but the way I expect pip install fugashi to work is that 1) before install I do not have mecab and have no idea how to get mecab, and 2) after install everything works and I don't have to worry about whether mecab is installed. I definitely do not expect 3) fugashi installed successfully but actually the mecab install failed so it doesn't work.

Fugashi already functions this way IMO, this is a bug in the installation process.

In theory, all routes seen in fugashi_util.py attempt to ensure a running configuration is produced upon install. mecab_config_linux_build is passing an rpath arg to extra_link_args which attempts to ensure that the built python extension will be able to find libmecab.so.

It seems broken however, and it looks like mecab_config_linux_build() is a rarely used and untested older codepath, which is why it's only shown up now (in the gap between python 3.10 release and a fugashi 3.10 wheel being released).

Some notes after digging into this issue:

  • In python 3.9.9, this Dockerfile installs libmecab in ~/.local/lib/mecab (with ~ being /root)
  • In python 3.10.1, this Dockerfile installs libmecab in /usr/local/lib/python3.10/site-packages/root/.local/lib/mecab
    • When pip is run with --user, this becomes ~/.local/lib/python3.10/site-packages/root/.local/lib/mecab
    • When pip is run as non-root with --user, this becomes ~./local/lib/python3.10/site-packages/home/$USER/.local/lib/mecab

So mecab_config_linux_build()'s behavior on 3.10 is apparently extra broken, and its behavior prior to 3.10 was to install a library on the user's behalf in non-python-managed directories, plus since the extra-link-args are erroneous in some way, it doesn't even produce a working configuration. I'm unsure if this has always been the case since its introduction or if it broke on a python or setuptools update at some point.

mecab_config_debian_user() does the same thing, just taking advantage of apt-get download rather than building the library from source.

And finally, mecab_config_debian_root() quite literally just installs a library on the user's machine using apt-get (this was what made it finally work when @garfieldnate ran apt-get update in the Dockerfile).

In my mind, there are two issues here:

  1. mecab_config_linux_build() and mecab_config_debian_user() don't produce working configurations, despite allowing the install to continue.
  2. Prior to Python 3.10, all three of these make unclear-to-the-user modifications to the user's system that aren't contained in python-relevant directories, with mecab_config_debian_root() still doing so even on 3.10. mecab_config_debian_root()'s modifications aren't even undone when pip uninstall is run, as it performs an apt-get installation, which pip has no way to track in .egg-info or .dist-info!

Honestly I think the solution to this is just to remove all three of these rarely-used codepaths and simply have users install mecab with their system package manager if installing the package from source.

For mecab_config_debian_root() I strongly dislike the idea of a pip install operation installing something using my system package manager.

For mecab_config_linux_build() and mecab_config_debian_user(), they're rarely used enough that this problem wasn't noticed until now (when wheels are not yet released for Python 3.10) and both could be replaced by simply asking users to install mecab with their system package manager.

It should be theoretically possible to fix mecab_config_linux_build() and mecab_config_debian_user(), but they fall into the category of codepaths that are prone to bit rot and aren't used often enough that they're worth it IMO.

For the record, these were all added in #8, so that may be helpful reference. It's true that some of these are rarely used.

What we can do is cut out most of these and leave the responsibility for installing mecab up to the user. As part of that, if no working configuration can be found, we can throw an exception instead of returning empty lists.

My concern with setup.py throwing an exception was that maybe there was some informational command, like setup.py info or something, that could be useful even without a working setup. But I can't think of anything like that and haven't found one poking around so I guess it's not worth worrying about.

Test release with these changes is up as 1.1.2a4.

If the automation aspect is removed from fugashi, I think it would be a good idea to update the readme with more detailed install instructions (minimum apt-get update && apt-get install libmecab-dev). It might be gross to change the user's environment, but it's also pretty convenient for many users XD. There might also be a middle ground where we prompt or use some optional install parameters to determine if mecab should be installed.

Fugashi will still install automatically for the vast, vast majority of users, to be clear, at least once the Python 3.10 wheels are pushed to PyPI.

That said, probably still not a terrible idea to have at least some mention of the install failure state and an example of installing mecab, like Nate mentions.

Found an issue with the new change - it breaks the sdist upload. Creating an sdist uses setup.py but doesn't require a working MeCab env. So I'll need to hack a workaround for that.

Hey, I am back working on this again. I should be able to resolve the sdist issues and make a release soon.

One extra issue that came up is getting support for OSX arm64 / M1 processors. I made a test build but I don't have any way of testing it myself; if either of you does, it'd be a big help if you could take a look at #55 and give it a quick test.

Sorry, I only have old Macs :( Thank you for working on this, though!

Changed the title of this issue to reflect what actually ended up happening.

The latest release, 1.1.2, incorporates the changes discussed here, so I'll close this, but if I forgot anything let me know.

Also, for the record, you still can't run python setup.py sdist without a working MeCab installation. But since few people need sdist and it's not super useful without MeCab anyway, I just made sure the CI installs it.