pypa / pip

The Python package installer

Home Page:https://pip.pypa.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pip needs a dependency resolver

cboylan opened this issue · comments

pip's dependency resolution algorithm is not a complete resolver. The current resolution logic has the following characteristics:

  • Only one top level specification of a requirement (otherwise pip raises a "double requirement" exception)
  • "first found wins" behavior for dependency requirements/constraints, which results in not respecting all constraints. pip 10+ print a warning when this occurs.

NOTE: In cases where the first found dependency is not sufficient, specifying the constraints for the dependency on the top level can be used to make it work.

pip install project "dependency>=1.5,<2.0"

(2019-06-23)

This is being worked on by @pradyunsg, in continuation of his GSoC 2017 project. A substantial amount of code cleanup has been done, and is ongoing, to make it tractable replace the current resolver, in a reasonable manner. This work enabled pip >= 10 to warn when it is going to make an installation that breaks the dependency graph. (The installations are not aborted in such scenarios, for backwards compatibility.)


(2019-11-29)

A status update regarding this is available here.


(2022-12-16)

See the closing note for details.

I added this to 1.5 because I think it's a pretty big wart that should be sorted out.

for now, the "pip solution" is to declare in your requirements file (or as an install argument directly) what specifier you want, and it overrides what happens down in the dependency tree.

i.e. in this case, put "foo>=1.0,<=2.0" in your requirements or add it as an install argument.

also, pip compiles everything it's going to install into a set first, then installs.
it's not installing everything as it's discovered, and it's discovery order is not depth first.
the order of discovery isn't the issue, but rather the lack of doing any resolution as it's compiling the set.
Currently, first found wins.

I keep meaning to add a "conflict resolution" section to the Internal logic section to the docs that covers this.
practically speaking, pip's override logic is pretty nice IMO, and is all most people need most of the time.

I believe, based on discussions with @dstufft, that this may be related to the problems people will have if they try to upgrade from old distribute to new setuptools in the same set of software as other things. Or, more specifically, if you depend on a piece of software that depends on distribute with an unbounded high end, and you are currently running on a pre-merge distribute, what will happen is that the dependency discovery will put distribute into the list along with the rest of the things that software A depends on, then it will add distribute's dependency, setuptools, to the list, later. THEN, when it's installing, distribute will be upgraded, which removes the distribute provided setuptools code, then the next thing in the list is installed, which is not setuptools, and which fails because setuptools is not installed yet.

@emonty the distribute to setuptools upgrade problem you speak of is described in #1064, but also here http://www.pip-installer.org/en/latest/cookbook.html#importerror-no-module-named-setuptools

but this is not related to pip's conflict resolution shortcoming issues described in this issue.

Just had this issue installing a project - on a system that already had python-dateutil 1.4 installed - that depended upon python-dateutil (no specific version) and django-tastypie 0.9.16. Even though django-tastypie has a python-dateutil dependency of >=1.5 (and not 2.0), and the parent has no specific version dependency on python-dateutil, python-dateutil remained at 1.4. This seems pretty fundamentally against what a package installer should be doing in such a circumstance :)

FWIW I've been experimenting with making a real resolver for pip.

@dracos there is a "solution" right now (short of a new resolver for pip; although that would be nice). Specifically declare what you want the python-dateutil requirement to be in a requirements file, or as a top-level pip install argument, and that will be honored.

http://www.pip-installer.org/en/latest/cookbook.html#requirements-files
http://www.pip-installer.org/en/latest/logic.html#requirement-specifiers

pip install myproject datetutil>=1.5,<2.0

@qwcode Thanks, I know that and will have to do so - but this means anyone working on the project will have to manually check all dependencies. Say someone upgrades django-tastypie and it now requires a later version of python-dateutil, there's no way to detect this besides manual checking of each dependency upgrade/install. Oh well.

@dracos understood. it's a pain point. but just want others who find this, to at least know, there is some kind of solution.

@dstufft as you work on a new resolver, keep in mind that top level requirements (i.e. pip install arguments are requirements file entries) will still have to be considered overrides (or dominant). that's a pip feature. Also, to state the obvious, this change will be "backwards incompatible" in the sense that many installations will turn out differently.

I've made a script (that is basically a patch to pip's prepare_files(), but I didn't want to patch pip, I can't control pip everywhere I use it for example) - available at https://github.com/dracos/check-pip-dependencies - that notes multiple dependency requests for the same package and outputs a list of conflicts, that you can then manually resolve using the methods suggested above.

This ticket looks like #174.

The pydata people at continuum analytics have built a packaging toolchain
more suited to scientific packaging (That's good, it can get funky).
Facing the same issues, they've implemented a dependency resolver using a SAT solver,
see here. The paper mentioned there is readable and they've open sourced the wrapper package
around the (open as well) SAT solver engine.

Basically, it's all there.

@y-p the only problem with that is that it relies on C code, rather than pure python. I presume that it would notionally be unacceptable to expect that everyone installing pip have a C compiler easily available on their system; thus you would have to provide a compiled pip for every system Python hopes to run on. That is a major ask (anyone compiling and distributing for python arm?).

It might be "all there", but it's in an unusable form for pip's general audience.

MIT-licensed pure Python: https://github.com/ActiveState/depgraph
(I just know of it, never tested it on the field)

0install used to be based on python (now on OCaml), and also wrote a pure python solver, which can be found in their 1.16 release (also in sat.py); afaik it was py23 compatible.

Also some good notes on sat solving on their page, and an awesome email.

minisat paper; GRASP paper

pkglib has recently released a pure Python dependency solver, FWIW: https://github.com/ahlmss/pkglib/blob/master/pkglib/pkglib/setuptools/dependency.py#L596

Thanks, I'll take a look at it when I take a look at the enthought one as well.

I've been reading some of the references. One thing I don't see in any of them is how version dependencies are managed - specifically, if package A depends on B (version >=2.0) how is that encoded. Is each version of a package treated as an entirely separate object for resolution purposes? Does anyone have any pointers to references on how this is handled?

So for a raw SAT solver yes. Basically a SAT solver lets you solve a boolean equation, ideally efficiently.

So if you have foo>=2.0 you'd look and see that foo has 1.0, 2.0, 2.1, 2.2, and 2.3, and you'd translate that into an equation like foo_2_0 or foo_2_1, or foo_2_2 or foo_2_3 (It's actually more complicated than that, because it also needs an equation to ensure that only 1 foo_x_y can be true at a time). The SAT solver will then spit back at you which set of variables should be True/False to satisfy the equation.

The most basic of SAT solver can simply set all variables to False, and then randomly try setting a single variable to True (and recording which it's already tried) to bruteforce the equation. Speed ups occur when you add other things on top of that such as backtracking (instead of starting over from the beginning you undo the last choice and try a different choice), simplifying (Finding subsets of the problem that are the same and collapsing them into themselves), as well as other techniques.

You can also be smarter about how you choose which variables to try to set to True. In the case above we probably don't want to randomly select one because if the only version specifier is foo>=2.0 then we'll end up with a random foo each time, but instead we'd want to try the highest version of foo first, and then go backwards, or even better, try the currently installed version of foo first, and then resort to trying the highest versions.

Basically all the SAT solver papers are just defining better ways at guessing which variables and computing/storing the results to speed up what is essentially a brute force problem.

@pfmoore in addition to dstufft's reply, I would guess that reading my previous link http://0install.net/solver.html would tell you how this is done, it's a great article.

@Ivoz thanks I'd missed that one

https://github.com/nvie/pip-tools project has pip-compile command dealing with resolution of dependencies. At http://nvie.com/posts/better-package-management/ there's this statement:

We’ve created pip-compile to be smart with respect to resolving complex dependency trees

Might be worth to find out what algorithm is used there.

Just for sake of completeness, see the enthought/depsolver project.

tl;dr: Automatic resolution of conflicting dependencies has two solutions which differ slightly in usability and performance. However, it makes sense to look and see if just doing simple dependency resolution is good enough.

Full post:

(I am attempting to summarize much of the discussion on the distutils mailing list with subject: "name of the dependency problem")

The way that I see it, there are three options: simple dependency resolution, backtracking, or SAT solving.

Most Linux package managers do not do backtracking or SAT solving. They simply perform dependency resolution and, in the case of an error, will print a message and halt. I believe they make this design choice due to two factors. First, the occurrence of conflicts is very low, so the added complexity isn't worth the headache. Second, oftentimes the automated handling of conflict may mask the fact that there is actually some underlying problem or issue with the underlying dependency specifications. In other words, when these dependency issues occur it is often because the existing dependency specifications are incomplete or inaccurate. This is a situation that the developers want to be aware of rather than having the issue solved for them in a potentially incorrect manner.

In situations where it is useful to have an automated way to resolve dependencies, backtracking and SAT solving are the two options I am aware of.

Backtracking dependency resolution, basically boils down to saving the state of the dependency resolver whenever you have multiple choices to resolve a dependency. Then if you reach a later point where there is a conflict, you can backtrack to the point where you made a choice and see if another option would resolve the conflict.

The other option is to turn the dependency resolution problem into a boolean equation and use a SAT solver to find a solution. The SAT solver acts like a black-box that finds a matching set of packages, possibly subject to some sub-constraint (such as installing the minimum number of packages).

Given these two candidate solutions, it is worth understanding their pros and cons.

[Speed]: If the number of conflicting constraints is zero or small, then the algorithms will have similar speed. If the number of conflicting constraints is very large, then a SAT solver will almost certainly be faster than backtracking. I do not honestly know where the cut-off would be, but would imagine ~3-5 conflicting delegations in a package would be where backtracking would start to be noticeably slower than a SAT solver. In the worst case, both the SAT solver and backtracking techniques will be exponential in their input because the problem is NP hard.

[Predictability]: One important issue for many users and developers is to understand why a set of dependencies were chosen. Backtracking dependency resolution algorithms will choose the most recent package version and, in cases where there are options about what to install, will prefer resolving a dependency using the order in which packages are listed when the dependency is specified. Thus the developer has a reasonable measure of control over how dependencies are resolved and it is easier for users to understand why dependencies are chosen.

The SAT solver finds a solution that resolves dependencies using in-built heuristics. This means that developers specify dependencies, but the resolver decides the choices to make to resolve them. In some cases, this may find solutions that can be counter-intuitive. (I will use a near-trivial example to try to make this as clear as possible. Real world examples are likely to be much larger and more complex.) For example, suppose that package A has a dependency that can be fulfilled by B versions 1.0 to 2.0 and a dependency on package C version 1.0 (which B 2.0 also depends on). Suppose that two systems install A and that those systems are identical except that one system already had C 1.0 installed and the other installed it as part of dependency resolution for A. It is possible that those systems would end up with different versions of B installed. This is because the SAT solver made different choices about what was optimal based upon the difference in initial state and already installed packages. This can be confusing to users because both systems end up with A and C installed, but have a different version of B.

So, to summarize, there are reasons to potentially choose either type of automatic dependency conflict resolution. Despite (re?)inventing backtracking dependency resolution when building Stork, I really do not have a strong opinion on what should be used. Both algorithms behave in a similar manner when there are no conflicts (the normal case). If there are many conflicts, SAT solvers are faster, but backtracking makes it easier for the developer to control how dependencies are resolved and makes the resolution process easier for the user to understand. There is no wrong answer here and the distinctions between them are not major.

The value of automatic dependency conflict resolution depends on how often it is needed and how hard dependencies are to resolve other ways. Detection could be performed at upload time for packages hosted on Warehouse, to warn the developer of the situations where their users will see conflicts. (This could be a feature regardless of whether a SAT solver, backtracking, or just simple dependency resolution is adopted.)

It is plausible that simply doing simple dependency resolution and raising an error when a conflict occurs (i.e. not choosing either scheme) could also be the most desirable outcome. Perhaps more data and discussion would be useful before deciding on a course of action?

Maybe I am missing something. Tell me if I am wrong.

Before the decision how to do the dependency resolution, the data structures and the transfer of them should be thought about.

How does a dependency resolution algorithm get the needed data?

I look at the warehouse code. Things get much more complicated since the dependency has an attribute "kind".

Related: pypi/warehouse#491 (comment)

How can a package depend on "foo installed with the 'build' kind of dependency"?

I think we should focus on data structures, their relations and the transfer of them. Algorithm can be very easy if the basics are done.

@JustinCappos to be clear, can you explain "simple dependency resolution"

short of a SAT solver or backtracking, it seems pip could improve itself significantly by simply trying to combine requirements (and failing if they can't be combined) as it parses them (in the current algorithm) vs just accepting the first one.

What I'm calling simple dependency resolution is to just greedily take the
first matching thing that fulfills the requirements. Usually, this
involves prioritizing the highest versioned package first.

For example, suppose that you have the following packages w/ dependencies:

foo-2.0 -- depends on bar
foo-1.0
bar-1.1
bar-1.0

If you ask to install foo, the dependency resolver will decide you likely
want foo-2.0 (the latest version). When reading the metadata for foo-2.0,
the dependency resolver will see the dependency on bar and resolve this
with bar-1.1 (the latest) and then select this for installation it as well.

However, there may be cases where this will fail to install a package when
there are conflicting delegations. Suppose we tweak the example above to
look like this:

foo-2.0 -- depends on bar and zap-1.0
foo-1.0
bar-1.1 -- depends on zap-2.0
bar-1.0
zap-2.0
zap-1.0

Simple dependency resolution will go through the same steps to choose
foo-2.0 and bar-1.1 as before. However, following this it will choose
zap-1.0 (to satisfy bar-1.1) and zap-2.0 (to satisfy foo-2.0). The intent
to install different versions of the same package is a conflict and, in the
case of what I am calling simple dependency resolution, the system will
print an error and terminate.

Note that this need not occur when two packages both specify a version
number. It is enough that a single package does so. Suppose that
dependency resolution is done depth first (so that bar-1.1's dependencies
are resolved before the second dependency of foo-2.0). If that dependency
of bar-1.1 simply specifies zap, then zap-2.0 will be chosen by the simple
dependency resolver. This will also cause some resolvers to declare an
error and exit.

On Sat, Apr 18, 2015 at 5:35 PM, Marcus Smith notifications@github.com
wrote:

@JustinCappos https://github.com/JustinCappos to be clear, can you
explain "simple dependency resolution"


Reply to this email directly or view it on GitHub
#988 (comment).

to be clear, I've seen 2 uses of "back-tracking" in the discussions.

the first, is what @JustinCappos mentions, which is back-tracking to try alternative versions when a conflict is discovered.

the second, is simply backtracking to re-walk dependencies, as additional constraints are discovered. For example, if you start with "A>=1", but then later discover the constraint "A<2", you may need to re-walk the dependencies of A, if the new constraint resulted in a new version of A.

the second type of "back-tracking" would be used in "simple dependency resolution", where you're still only reporting conflicts and not attempting the first type of "back-tracking" to fix the conflict.

I'm not sure that these situations are completely disjoint. As I see it,
the constraints depend on a specific version of a package because they come
from that package's metadata. For example, it is B-1.1 that depends on
A>=1, not B because B-1.0 may have no dependency on A at all.

On Tue, Apr 21, 2015 at 10:28 PM, Marcus Smith notifications@github.com
wrote:

to be clear, I've seen 2 uses of "back-tracking" in the discussions.

the first, is what @JustinCappos https://github.com/JustinCappos
mentions, which is back-tracking to try alternative versions when a
conflict is discovered.

the second, is simply backtracking to re-walk dependencies, as additional
constraints are discovered. For example, if you start with "A>=1", but then
later discover the constraint "A<2", you may need to re-walk the
dependencies of A, if the new constraint resulted in a new version of A.

the second type of "back-tracking" would be used in "simple dependency
resolution", where you're still only reporting conflicts and not attempting
the first type of "back-tracking" to fix the conflict.


Reply to this email directly or view it on GitHub
#988 (comment).

not saying their disjoint, just that "backtracking" (in the 2nd sense) can occur in a "simple dependency resolver". In both cases, it's fundamentally the same process, but just for different purposes.

https://github.com/CocoaPods/Molinillo is something (written in Ruby) that perhaps we could look at for inspiration as well.

@JustinCappos if you read again 0install's summary I believe it demonstrates an easy way to solve the "get confusing/random choices out of a SAT solver" problem.

Thanks for sending this. I'd seen the unoptimized version before from the
OPIUM paper with the version number constraints, but hadn't seen this
tweak. (An aside: it is interesting to me is that the rationale behind the
tweak mostly seems to be to improve the performance, but that it had a side
effect where it made the package selection behavior more predictable to the
end user.)

To summarize, the modified algorithm works as follows:

Take the package you want to resolve and find the highest version number
you can resolve using the SAT solver. Choose that package. Take the first
dependency of that package and assume the package you already selected must
be installed and find the highest version number you can resolve using the
SAT solver. Choose that package. (Continue this process recursively in a
depth first manner, solving for all packages.)

What is interesting to me here is that the 0install folks came up with a
very similar solution to backtracking dependency resolution (which uses
recursion and performs package selection in exactly the same manner). The
difference is that the backtracking dependency resolution algorithm
determines that packages cannot be resolved by trying candidates itself
(and then backtracking), while the 0install solution leverages a SAT solver
to find that candidates will not be resolvable (and thus avoiding them).
As I understand it, both algorithms will come up with the exact same
solution in every case.

So, with that in mind, I suppose it is worth looking at other factors.
Performance and maintainability spring to mind. It would not surprise me
to learn that the SAT solver optimization is faster at pruning than
backtracking dependency resolution. I don't know if pip already requires a
SAT solver. Also, I don't know what performance is likely / acceptable for
either solution.

I have a student, Bing who is planning to look into the dependency
resolution options over the summer. We can look at 0install's
implementation as well and provide some numbers to help people to compare.

Does anyone still want to advocate for the OPIUM-style SAT solving? From
the 0install post, it seems likely to be both slower and less
understandable than what they implemented. So I would think that it would
not be likely to be selected for inclusion. (If there is little interest
in it, we will prioritize results for the other algorithms instead.)

On Wed, Apr 29, 2015 at 3:04 AM, Matt Iversen notifications@github.com
wrote:

@JustinCappos https://github.com/JustinCappos if you read again 0install's
summary http://0install.net/solver.html I believe it demonstrates an
easy way to solve the "get confusing/random choices out of a SAT solver"
problem.


Reply to this email directly or view it on GitHub
#988 (comment).

Honestly, I don't much care what the actual solution is as long as it will give us a predictable and "ideal" (where ideal == the latest acceptable version that matches all constraints) set of things to install. A constraint we have is that we won't know the full list of dependencies up front, ideally in the future more and more of them will be available via an API provided by Warehouse, but for a long time the vast bulk are going to require downloading the files from PyPI and inspecting them (possibly even executing some Python) to determine what the dependencies are.

This says to me that whatever system we use needs to be able to handle incrementally adding new constraints into it and ideally it should attempt to minimize "wrong" guesses because each attempt is high cost for us. This sounds a lot more like the backtracking solution that @JustinCappos mentioned than a traditional SAT solver, but this is really an area I don't have a ton of foundational knowledge in so I'm personally figuring out the "right" answers as I go along.

Me too. I think the resolver algorithm should be replaceable. For the first step a simple one is enough. Later, the details could be improved. Optimize later.

For me the big question is: where does the resolving happen? At the client or at the server?

And the next question, nobody could answer me up to now: How to handle the "kind" attribute of the dependcy. Related db schema of warehouse: https://github.com/pypa/warehouse/blob/54c67bab2b150acad1267d4d0a4b32426adaa62d/warehouse/legacy/tables.py#L471-L485

I think it would be good to keep it simple: remove the "kind" attribute. If a package "foo" has several kinds,create N packages from one git/hg repo: "foo-dev" "foo-with-fancy-feature" ...

Do RPM/DPKG have dependecies which has a "kind" attribute?

Somehow related: The open street map database schema evolved during the first years a lot. Guess what was done? It was simplified: Less tables, less attributes. Special cases were migrated to a simple and common graph structures.

Resolving will happen client side, and the KIND is one of:

  • requires
  • provides
  • obsoletes
  • requires-dist
  • provides-dist
  • obsoletes-dist
  • requires-external
  • project-url.

A lot of these don't make sense or they only make sense in a legacy sense, in reality you're only really dealing with something that says it requires certain things, something that says it provides a certain package, and something that says it obsoletes something.

Ok so, this issue has gotten long and a bit academic. The tl;dr in my opinion is:

  • erroring and stopping isn't very useful.
  • just unioning can only handle the most simple of cases, and certainly none of the ones the projects i'm involved with which care about this care about.
  • I don't care about the implementation approach. backtracking, formulate as SAT, random walk, simulated annealing... meh. We're unlikely to get it right first time, because PyPI is so many orders of magnitude harder than distro packaging resolution, its simply not at all funny.
  • we need something inside pip, because this has to kick in anytime conflicts exist with installed requirements or even potentially user pre-stated constraints (there are issues for these, don't have them offhand sorry).
  • I've written a backtracking resolver inside pip and am in the final stages of coercing the surrounding refactorings I did to let all the existing tests pass, which will then lead into making the code layout and structure actually be pleasant (its not too bad right now but thats only relative to what came before....)

Thoughtful comments and kibbitzing on my PR (or an alternative PR that shows a better way) appreciated and solicited. Even just testing it out to see if it eats your dog would be helpful to me. Thanks!

erroring and stopping isn't very useful.

I think users would prefer an error over what happens now, where conflicts are quietly ignored. With an error, a user could respond and add additional constraints to solve the conflict

just unioning can only handle the most simple of cases

a lot of the cases I've seen in the pip tracker would have been solved with a simple AND of the additional constraints. so I understand, can you explain the cases you care about?

Resolving << erroring << today, agreed. But since I have a working resolver branch, I think debate is largely academic and uninteresting at this point. Unless someone wants to content that erroring << resolving << today.

Unioning fails on transitive situations.

Simple cases like the following work with unioning: A depends on B and C >=1.0. B latest release depends on C >= 1.2. Today you get 1.0 and its all broken.

Unioning fails on situations like this: A depends on B and C >=1.0, <2.0 (e.g. ~= 1.0). B latest release depends on C>=2.0. B's prior release depends on C with some specifier that permits a C <2.0.

commented

@rbtcollins - I am a little confused by the direction of those less-than signs. It reads like you're saying that resolving is worse than erroring is worse than what we have today, but then everything else sounds like you're saying the opposite of that? Would you mind clarifying just a bit?

Read as evolution not less than. ETIRED, not being all that clear :)

A depends on B and C >=1.0, <2.0 (e.g. ~= 1.0).
B latest release depends on C>=2.0.
B's prior release depends on C with some specifier that permits a C <2.0

I wouldn't describe this as "Unioning fails".
Unioning, as a thing, is not the problem.
The "failure" is that a simple single-pass approach can result in a conflicting union.
If using backtracking, it requires more passes to find a good union.

I'm not sure that I would say that resolving is better than erroring is
better than today.

A different way to look at "erroring" versus "resolving" is to consider
what a conflict is. It is likely that a developer was confused when
packaging software possibly because different developers did not
synchronize their dependency requirements. They certainly didn't test the
current combination of software described by the requirements on the
repository. As a result, at best, the requirements as specified are
untested, if not outright incorrect.

Now, the question is, what should be done about this issue? Many (most?)
Linux distributions choose the "error" approach because they want a
developer to be aware of the issue, look at it, and fix it. The
alternative "resolve" solution tries to have the package manager decide
what to do without user or developer interaction.

I'm not sure whether erroring is better or worse than resolving. To try to
answer this question a student of mine is starting to look quantitatively
at the problem and the steps that may make sense moving forward. We're
very interested in checking out your PR and seeing how your solution will
compare with others!

Thanks,
Justin

On Thu, May 14, 2015 at 3:56 AM, rbtcollins notifications@github.com
wrote:

Resolving << erroring << today, agreed. But since I have a working
resolver branch, I think debate is largely academic and uninteresting at
this point. Unless someone wants to content that erroring << resolving <<
today.

Unioning fails on transitive situations.

Simple cases like the following work with unioning: A depends on B and C

=1.0. B latest release depends on C >= 1.2. Today you get 1.0 and its all
broken.

Unioning fails on situations like this: A depends on B and C >=1.0, <2.0
(e.g. ~= 1.0). B latest release depends on C>=2.0. B's prior release
depends on C with some specifier that permits a C <2.0.


Reply to this email directly or view it on GitHub
#988 (comment).

Interesting resource: https://people.debian.org/~dburrows/model.pdf

Implementing the full backtracking algorithm shouldn't be that much work, this is a solved problem anyway; people implement much more complex algorithms in Python everyday. However I don't know how much metadata pip can get at this time. If getting the dependencies for a version of package B involves downloading the tarball and running the setup.py, backtracking to find a good solution suddenly becomes really slow.

On another point, if we're discussing doing this right, we probably need "conflict" specifiers as different libraries might provide the same interface (i.e. same top-level package name). This doesn't really add difficulty to the resolution algorithm, but packages would need a way to specify this.

Of course the pip/setuptools/distutils situation is not helping 😢

@JustinCappos w.r.t. distro packaging. apt-using distros (Debian, and Ubuntu etc) don't error - they resolve. I don't know what metric you're using for 'most' - # of distinct packaging systems, # of product OS's using a given system (so weighted by vendors) or # of users (so weighted by market success) - but I'm pretty sure apt has a massive dominance if you weight by users, and is still very large weighted by vendors.

@remram44 - Daniels paper is less relevant to pip than it may seem. Certainly the translation to SAT is in common, but distros aggressively trim the space - there's typically at most 5 versions of any one package being chosen amongst (missing, last-release-of-distro, last-release-of-distro-security, new-release, new-release-of-distro) - and only 3 versions for installation runs being done within a release. Whereas we have dozens or hundreds of versions. Not having conflicts is an advantage in many ways, though I do agree way may need them. We probably want to be a bit conservative about that though. My branch implements full backtracking with pseudo random probing (which IIRC has been shown to support getting out of bad brute-force scans efficiently), and linear best-first probing of a given dep. I'll be super enthusiastic to see folk putting forward different resolution algorithms: the resolver is 100 lines of code: the rest of my branch is refactoring to deal with the reality of having numerous versions which we need to cache on disk while we're choosing (because we can revisit a given version many times - hundreds or thousands easily).

I think users would prefer an error over what happens now, where conflicts are quietly ignored. With an error, a user could respond and add additional constraints to solve the conflict

+1, I can't count the number of times invalid dependencies have been put together because an invalid situation was ignored.

In the cases where a "fail-fast" approach is taken, there should be more detailed messages about the conflicting requirements. Such messages should list the conflicting requirements and possibly suggest workarounds to whatever extent possible. Current messages list the package whose installation failed, and only one of the requirements, e.g. "pkg_resources.ContextualVersionConflict: (requests 2.5.1 (/usr/local/lib/python2.7/dist-packages), Requirement.parse('requests<2.5.0,>=2.2.1'), set(['docker-py']))".

Clearly docker-py is involved, but the other, conflicting packages are not shown. As a developer, my reaction to this message is "Okay, I understand requests 2.5.1 is what is currently installed, and I understand that docker-py doesn't like requests 2.5.1, but I'm installing docker-py. Why won't pip install what it needs?"

commented

I think users would prefer an error over what happens now, where conflicts are quietly ignored. With an error, a user could respond and add additional constraints to solve the conflict

+1, I can't count the number of times invalid dependencies have been put together because an invalid situation was ignored.

I agree. Errors should never pass silently.

We would love a solution to this issue: also cf. #3183

commented

Fellows, may we use the NuGet v3 dependency resolution approach? It's better than nothing, isn't it?
It's seems to be viable..

I think this is what the conda guys use: https://github.com/ContinuumIO/pycosat

@Juanlu001 unfortunately it's not pure python

One issue is that any pure-python solution will be slow, except perhaps if we are running on PyPy.

However, it seems to me that the better solution is to allow the same package to have two different versions installed, eliminating conflicts.

allow the same package to have two different versions installed, eliminating conflicts.

Due to how importing in Python works, I believe this to actually be possible by providing a dynamic Loader for sys.meta_path, which checks for the package's specified dependencies in order to load the correct version, and enabling that for all modules/packages installed via pip. I do not know how one would accomplish a more or less semi-permanent modification of sys.meta_path however.

@drbo the complexity is O(n^e) - so exponential in the size of the graph: CPython performance vs pypy vs C is linear. Further the constant cost of examining a package is high (network transfer of the archive), so yeah - Python performance is insignificant.

However, it seems to me that the better solution is to allow the same package to have two different versions installed, eliminating conflicts.

Not great.

What happens when liba v1.2 emits a dictionary with 3 fields, but libg requires v1.4 of liba which has added an extra field in the dictionary? If these are all allowed to mix in the same runtime, then generally a developer will bump into cryptic errors after doing a normal update of dependencies. Telling them you just won't run two different versions of the same library at the same time eliminates this class of runtime bugs.

also python does not support importing more than one version of the same code

What happens when liba v1.2 emits a dictionary with 3 fields, but libg requires v1.4 of liba which has added an extra field in the dictionary?

The approach here wouldn't be to have each library have a private version of its dependencies, and bring the kind of madness only the Node.js world could tolerate, but to do dependency resolution out of the installed package versions when a program is started. I believe this is the setuptools way, this is why the setuptools-generated entry scripts look like this:

#!python
# EASY-INSTALL-ENTRY-SCRIPT: 'reprounzip==1.0.4','console_scripts','reprounzip'
__requires__ = 'reprounzip==1.0.4'
import sys
from pkg_resources import load_entry_point

if __name__ == '__main__':
    sys.exit(
        load_entry_point('reprounzip==1.0.4', 'console_scripts', 'reprounzip')()
    )

The point of pkg_resources was always to do all kinds of path/packages manipulation but most of that was never widely used.

One approach would be to isolate multiple versions from each other in the
same process, but mutable globals keep that from working in Python. It
might work in Haskell or Rust, but not in Python.
On Apr 7, 2016 5:02 AM, "Matt Iversen" notifications@github.com wrote:

However, it seems to me that the better solution is to allow the same
package to have two different versions installed, eliminating conflicts.

Not great.

What happens when liba v1.2 emits a dictionary with 3 fields, but libg
requires v1.4 of liba which has an extra field in the dictionary? If
these are all allowed to mix in the same runtime, then generally a
developer will bump into cryptic errors after doing a normal update of
dependencies. Telling them you just won't run two different versions of the
same library at the same time eliminates this.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#988 (comment)

What pkg_resources does, doesn't actually remove the need for a dependency resolver. It just pushes the need from an install time need to a runtime need (e.g. if you dynamicalyl adjust the sys.path to add the needed version you have to resolve which one of the installed versions you should add, taking into account all of the version specifiers that may be required). This is a bit of a narrower field since you're unlikely to have every version of every library installed, but on the flip side it means that we inflict our own dependency resolver on other packaging systems that otherwise have no need for them (like when installing with apt-get), nevermind the fact that pkg_resources is incredibly slow because it has to issue a ton of stat calls and traverse the entire sys.path looking for installed files, a cost that you have to pay on every entry point using script.

Similar issue with @dracos

  1. A==1.0 depends on B==1.0 and C(no specific version)

    B==1.0 depends on C==1.3

  2. In a python environment that already had package C==1.0 installed.

  3. After install A:
    A==1.0

    B==1.0

    C==1.0

  4. C won’t be update to 1.3

Just to add to the conversation, and for reference to anyone else finding this issue, if you specify a pin in requirements and something else in constraints, the requirement will be ignored and only the constraint used. I've shown an example here: https://gist.github.com/odyssey4me/d8acc9888cc206818e21e059f28b3576

#4182 has an interesting sub-problem, specifically related to build dependencies.

Enthought has been moved to enthought/sat-solver and has advanced significantly.

@westurner The main blocker is that requirements cannot be downloaded reliably from pypi.

@xoviat

@westurner The main blocker is that requirements cannot be downloaded reliably from pypi.

In terms of? https://status.python.org/

Would it be helpful to cache the catalog [JSON] dependency metadata?

Would it be helpful to cache the catalog [JSON] dependency metadata?

Now you are onto it. What if pip downloads a package and then discovers that it downloaded the wrong package? This may happen hundreds of times with a dependency solver enabled. The only reliable solution is to be able to fetch metadata without downloading the entire package. But currently the API that can be used to do this doesn't work half the time (possibly exaggerating, but still generally correct).

To be clear, the API "doesn't work" not because it's "down" but because it only works for particular packages.

One issue is that any pure-python solution will be slow, except perhaps if we are running on PyPy.

We did not investigate substantially, but in our tests, we found our own simplesat to be ~ as fast as conda's solver using picosat. It is definitely too slow for cases where you have 1000s of packages installed, but those are rather rare I think.

The issue is definitely not the SAT solver.

Also some good notes on sat solving on their page, and an awesome email.

@Ivoz The link to the email is broken. Is there some other place I could find it?

Is there no interest in at least adding something that errors in the current situation where pip will happily install broken sets of dependencies? I agree a perfect solution would be lovely, but right now pip will silently create broken installations because it seemingly just grabs the first dependency restriction and uses that.

I'd be happy to contribute a patch that implements correct stop-and-error behaviour (maybe behind a CLI flag if there's backwards-compat issues) so that the current implementation can at least stop making broken installs, as a stepping-stone to having a full solver for the dependency constraint set. Otherwise, we're going to have to find some way to hook a dependency checker into the installation process ourselves for all our internal packages and it's going to get messy.

@andrewgodwin There is #3787 that tracks the same. I do want to work on this but can't because of lack of time right now...

I'd be happy to contribute a patch that implements correct stop-and-error behaviour

Please do. 😄

@andrewgodwin Are you aware of pip check?

I was not - I'll look into it tomorrow, but it sounds like it still needs to be run separately after an install is completed?

Yes it does.

I think that the ideal place to stop-and-error is before the installation occurs and after the requirement set is finalized. I have detection for dependency conflicts already set up at this point in an old experimental fork of 8.0.0.dev0 pip here. The code's not fit to use, but it would probably help serve as a guide if someone wanted to do this in production pip.

The larger project accompanying that fork is here, so that might be helpful reading. (Incidentally, draft results are here.)

Another issue that'll depends on this to be implemented - #53.

Edit: As it turns out, maybe not?

What a fucking joke.

@atrigent Behave respectfully or leave.

From a discussion in IRC, currently the use case of passing an option through to a dependency in extras_require also doesn't work:

If in setup.py we have something like:

install_requires=["package"]
...
extras_require={"coolextra": ["package[coolextra]"]}

When I pip install -e .[coolextra], it just installs the package without the coolextra option.

Just wanted to mention it here so that it's tracked, so whatever solution that fixes this issue will also fix this. Alternately, if extras_require has some way of specifying "if no options are specified" or "if this particular option is not specified" it might be solvable by specifying:

extras_require = {
    "!coolextra": ["package"],
    "coolextra": ["package[coolextra]"],
}

Update: actually apologies, reading through all the linked issues more carefully, it seems like it'd be the same issue as #4391.

I'll be working toward resolving this issue over this summer, under the mentorship of @dstufft and @JustinCappos, as a part of GSoC 2017.

For anyone who's interested:

  • The final proposal for this work can be found here.
  • Regular updates on the work being done would be posted on this blog.

PS: I just realized no one had mentioned this here. I took the liberty of doing it myself - hopefully no one minds.

Not sure if that was mentioned somewhere above, but I could not find it.
The solution for this bug which hopefully will appear somewhere in the future should also handle the following scenario:
Currently, when certain package is referenced in several places of dependency tree, the first encountered reference will win. And all [extra]s attached to it will also win.
I am sure there should be a mechanism of combining these extras. So what I want to stress here is that while it might be hard to resolve version conflicts between several occurrences of the same package (and sometimes impossible to find "fits-all" solution), it is trivial to handle extras - just merge them.

Hi @MarSoft!

Combining extras is, indeed, the right behaviour here. Currently, this behaviour change is being tracked in #4391. It's been mentioned in #4653, #4664 and also #3903 was the original tracking issue for combining extras IIRC.

As of right now, I'm rewriting the dependency resolution logic of pip and this is one of the situations that would get fixed as a part of that. :)

Looks like @pradyunsg just updated a blog to let everyone know he's coming into the final stretches. I'm rooting for you @pradyunsg, as I'm sure is everyone else on this thread!

https://pradyunsg.github.io/gsoc-2017/08/14/final-lap/

💪

Woah! Thanks for some extremely motivating words @greysteil! ^>^

@pradyunsg Are you still working on this? Is there an ETA? No rush but I would like to use the resolver API in a PR.

I went AWOL on this.

Are you still working on this?

Yep. Very much so. :)

I would like to use the resolver API in a PR.

Well, I'd say you wait for a while. There's one open PR that changes the Resolver API (#4636) so that should drive my point home, that the "API" hasn't really settled down yet.

If you are going to go with SAT solver (which I really recommend to), you could use https://github.com/openSUSE/libsolv, it is used by package managers of various Linux distributions such as openSUSE/SUSE (zypper), Fedora (dnf), Mageia (dnf).

P.S. sorry, I didn't have time to read full story behind.

Not helpful.

Hi @ignatenkobrain! :)

Thanks for the suggestion but...

  • a SAT Solver isn't a viable option. The reason for that is that dependency information is not completely available statically, it requires the execution of setup.py which is a potentially non-deterministic; the determinism is a requirement for using a SAT solver.
  • pip won't package a C dependency (for a laundry list of reasons that I don't recollect)

it requires the execution of setup.py which is a potentially non-deterministic

I'm highly intrigued. Can you give example of non deterministic setup.py ?

@piotr-dobrogost: please consider having discussions about how Python works off this ticket, a lot of people are tracking it for updates on the feature.

I'm highly intrigued. Can you give example of non deterministic setup.py ?

Since it's just python, developers can do whatever they like within it. A common example is having the version number live in some other module and having setup.py import that - I've personally seen this not result in different setup.py output but rather in failure, where it ends up importing more code than intended in certain cases and caused setup.py to error out.

I'm highly intrigued. Can you give example of non deterministic setup.py ?

I can see why one might want to say "non-deterministic", but that can be misleading. Non-static is what @pradyunsg means, I think?

Assuming the adjusted question is of interest:
For example, some time ago, I mirrored PyPI locally and ran a few test installations for each of several hundred thousand package versions. For 16% of the distributions, installation in Python2 resulted in different installed dependencies or dependency versions than installation in Python3, in an otherwise identical environment. setup.py can do whatever it wants, and chained dependencies result in a lot of differences.

(I'm not sure where else to continue the conversation, by the way - the distutils mailing list?)

Folks, this is not the place to discuss this. You can downvote me all you want, but the fact is that this issue is for discussing the dependency resolver, not Python behavior.

(I'm not sure where else to continue the conversation, by the way - the distutils mailing list?)

Yes, that would be the perfect place to discuss.

Hey everyone! Hope you've had a great start to 2018!

This issue has been quiet for a while now and I have stuff worth telling. :)

There's a branch of pip that is capable of printing a warning when an incompatible set of packages is going to be installed. I'm not completely sure if the UX is good enough, so I'm seeking input on this behaviour; whether it is indeed something that would be an improvement over the status quo, what can be potential improvements to it etc. I'd appreciate any feedback on it over on pradyunsg#1.

As for proper resolution, all the preparatory refactoring that was needed on pip's side is basically done (it's what made the validation stuff mentioned above much easier to do). It was much more work than I'd anticipated initially.

I'm working on the resolver separately now and will bring it into pip once it's ready. What has been implemented currently already handles whatever I've thrown at it yet, including interesting situations with cycles and optional dependencies (extras). If you have a situation where pip 9's resolver does the wrong thing or causes issues, I'd really appreciate it if you could file an issue over at the zazo, detailing the project and versions involved to help make sure that zazo's resolver will Do The Right Thing™ (especially if there's a large number of projects and versions).

Cheers!

edit: Look, 100th comment!