wayfair-incubator / extra-model

Code to run the ExtRA algorithm for unsupervised topic/aspect extraction on English texts.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Eliminate non-pure Python dependencies without wheels

jamescurtin opened this issue · comments

Currently, extra-model has dependencies on pycld2==0.31, cytoolz==0.9.0, and spacy==2.0.18: these pacakges either directly or indirectly use C extensions that are not shipped as a wheel. As a result, gcc is a requirement of extra-model so that these dependencies can be build from source.

The best case is to eliminate any dependencies on gcc. If so, images deployed to production will (1) be smaller, (2) build faster, and (3) be more secure. Additionally, users of extra-model are less likely to encounter installation errors because of missing C libraries.

It should be possible to eliminate the dependency on gcc with the following changes:

  1. cytoolz: Neither cytoolz nor toolz are used in the codebase (perhaps an old dependency that was never cleaned up?) We can remove this package from the requirements file.
  2. pycld2: This project hasn't been updated since 2019. If upgrading to use cld3 would be acceptable (difference between cld2 and cld3), we could use pycld3 as a drop-in replacement. pycld3 provides wheels for compatibility and is actively maintained.
  3. spacy: Newer releases of spacy eliminate the offending dependencies. There is already a PR (#54) that updates spacy to a compatible version.

Once these changes are made, we can start using the slim-buster docker image instead of buster. The slim version is substantially smaller (112MB vs. 875MB) and doesn't contain gcc--which replicates a desirable production environment.

I'll put up a draft PR to demonstrate---and once #54 is merged I will update the PR to use the slim image.