nansencenter / DAPPER

Data Assimilation with Python: a Package for Experimental Research

Home Page:https://nansencenter.github.io/DAPPER

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

requirements.txt

patnr opened this issue · comments

I hate dependency management. I never quite learnt the inns and outs (who has?). Recently I started using poetry for my libraries, which is great, but I think DAPPER will stick with anaconda/setup.py for the foreseeable future (Why?). I don't want to use Anaconda's environment.yaml file because sometimes I like to try installing without Anaconda.

However, setup.py:install_requires only lists the direct dependencies. Not the sub-dependencies. This means other installs may get different (untested) environments. Since DAPPER is an app, not a library/package, we need to freeze all dependencies. So we need to do

  • So
    • when we feel like it (i.e. for a new release)
    • or when we a new package to install_requires

we need to

  • Create a new, clean environment
  • Update versions in install_requires if desired
  • Install DAPPER
  • pip freeze > requirements.txt.
    Unfortunately this creates yet another file.

Also, we need to update the install instructions in README to use requirements.txt. Unfortunately, we cannot include the python version in requirements.txt. This is frustrating.

And the version specifications in install_requires needs revision. Should we pin them? The requirements.txt file will already use pinned versions, so it might not be necessary. It would also require more manual editing when updating the direct dependencies.

@yumengch Any thoughts?

Main reference: https://stackoverflow.com/a/33685899

Edit:
Instead of requirements.txt, requirements-dev.txt, etc, maybe the pinned lists could be included in setupy.py as well somehow?

Edit:
This is probably a more useful reference: https://news.ycombinator.com/item?id=11210370
The discussion predates pipenv/poetry, but that's ok here.

I'm not very familiar with deployment so far but a quick Google search leads me to the pipenv .

If I understand it correctly, pipenv can lock the dependencies (including transitive dependencies), create Python virtual environment automatically (avoiding conda env), and Python version. I feel the functionality is quite similar to Poetry.

Anyway, using either of poetry/pipenv/conda requires additional installations and the I think Peotry and pipenv can be installed via pip without too much efforts. I personally feel pipenv/Peotry are very powerful and many well-known packages/software are now distributed via conda, such as Pytorch or ncl. I guess people who use DAPPER won't feel it as a problem to install any of them but we have to write how to use them in the doc.

Using requirements.txt might be enough for DAPPER, at least for the time being, and it's a bit cleaner for avoiding new packages.

We need to make compromise in the end (Actually, I think we can have all of them if you want.) but I think it is also up to your taste.

A side note: I notice that pip 20.3 is introducing a dependency resolver, but I am not sure if it will lead to deterministic dependencies as you may want.

Pipenv is just like poetry, except it doesn't automate PyPI uploads. I also find poetry to be better overall. I've read alot around this issue. So although I cannot be very specific, I would favour poetry over pipenv.

However, I think for DAPPER (which is an app, not a library) Anaconda + pip is the better choice. Mostly because when creating an environment with poetry (or pipenv), and then installing DAPPER, I did not get an interactive backend.

Anyways pipenv/poetry/anaconda is an additional installation, as you say, so there's nothing to be gained there. However, note that they should all be installed outside of pip.

Also note that DAPPER should usually be installed in editable (-e) mode, not via pip (although I'm trying to get the DAPPER name available on PyPi).

It might be preferrable to start recommending miniconda instead. Not sure.

pip 20.3 is introducing a dependency resolver

Interesting. I'm not sure if it will have an impact, but it's worth looking into.

Having thought a little more about it, here are my (still somewhat uncertain) thinking.

As an app (in contrast to a library) DAPPER should not worry much about compatibility with other packages (except for its own dependencies 😛) and their versions. Thus, it can (and so should) pin (all) its dependencies, e.g. pip freeze > requirements.txt or poetry/pipenv > [some lockfile].

That being said there's the complication (that I could not find much advice on) that we do want compatibility with Colab. Now, in Colab you can do !pip install some-package, but this doesn't work for matplotlib1, and so we must use Colab's mpl version. Therefore, so far, I have pinned mpl. But now I think it's best to merely use >=, so that upgrades to Colab (which are not widely announced) don't necessarily break the installation of DAPPER. This also allows using newer mpl elsewhere, which tends to help with supporting newer python's.

Another thing I find useful for Colab is to only pin the top-level dependencies, because many of the sub-dependencies will already be installed and re-installing them -- while less troublesome than for mpl -- can take a long time.

There is also a degree to which DAPPER can be considered a library. For example, besides Colab, it is used by DA-tutorials. Also, the case of just using it for some scripts (like those in examples/) is envisaged by the installation instructions.

All in all, while we should mainly pin, we cannot do so as much as a true lockfile solution would do. I.e. we won't get exact reproducibility, but it will be pretty close, and hopefully a good compromise.

A related aspect is that distribution via PyPI (or more generally, being able to do pip install dapper) requires packaging, which requires setup.py (or pyproject.toml) to "build" the package distribution before uploading.

Usually, setup.py usually only lists top-level dependencies 2, which are not pinned. However, but since we mainly want to pin, but won't be using a lockfile, we might as well do so directly in setup.py.

Upgrades to these top-level dependencies can be carried out by removing their pinning, re-installing DAPPER (pip now does dependency resolution!), and then re-pinning the dependency to the installed version. Testing can be done by the CI, but should also include (manual) testing on Colab.

Footnotes

  1. Installing another version produces errors (verbose or silent), and while it's possible to circumvent by re-starting the kernel this is complicated and slow.

  2. This list in setup.py is then used by pip or poetry to generate the requirements.txt or the lockfile -- a process involving dependency resolution. In the case of apps, you want to install from such a lockfile, as discussed on top; in the case of libraries, the lockfile and the near-exact reproducibility it provides mainly serves for the package developers, not the users