repology / repology-updater

Repology backend service to update repository and package data

Home Page:https://repology.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handle ambigous name/version splits in nix

AMDmi3 opened this issue · comments

Nix metadata format does not split package names from versions (e.g. "name": "foo-1.0"). On top of that, both name and version may have hyphens in them. So there's no way to split name and version reliably, and though we use some heuristics (take largest part from the right which starts with the number), there are cases which are processed incorrectly (liblqr-1-VER, python3.6-3to2-VER, polkit-qt-1-qt5-VER etc.). I'm running into a lot of these lately, so this needs to be fixed.

@ryantm @volth could nix dump format be extended to provide separate name and/or version field(s)?

There is an ongoing migration to supply each package with pname and version

That's great! How much of it made its way into unstable dump? I'm already seeing some "version"s, but not "pname"s yet. Though it's enough to split name reliaby, I'm checking it right now.

Nix has a builtin function to split the name, but it seems that it would fail on your examples

It turns out that I'm already using it, it's even mentioned in the source comment that references https://github.com/NixOS/nix/blob/master/src/libexpr/names.cc#L19.

It would be nice it repology will list packets with unparseable names in its the list of problems.
it should result in quick fix, people love to fix the list of problems.

Unfortunately problems support is currently quite limited, however this information can be dumped to update log (available from https://repology.org/repositories/updates#nix_unstable).

Update: I've tried to use "version" in the parser

There are some inconsistencies:

xzoom-0.3.24, version 0.3
riot-desktop-1.1.0, version empty

However much more packages were fixed:

-nix_unstable fuse 7z-ng-git-2014-06-08
+nix_unstable fusefs:7z-ng 2014-06-08
-nix_unstable lisp-trivial-utf 8-20111001-darcs
+nix_unstable lisp-trivial-utf-8 20111001-darcs
-nix_unstable wmii-hg 2012-12-09
+nix_unstable wmii hg-2012-12-09

Some packages with version not starting with number are now parsed (like lisp-drakma v2.0.4).
Some packages are now parsed with hg/git as version prefix, not package name suffix, which is good for matching them with other repos.

It would be nice it repology will list packets with unparseable names in its the list of problems.

It doesn't make sense to dump all potentially ambiguous names - these are basically everything with two or more hyphens, and there is too many of them.

I meant, to mark the packages without pname and version as problematic.

Well as of now it's all of them, so it doesn't make much sense. It makes sense though to report most suspicious ones. I've found that most of packages I've had to add exceptions for fall under '-[0-9]+[a-z]' regexp. I'm logging these now.

I'm deploying update with "version" aware parser and extended logging, so keep an eye on the update log.

commented

Note that pname and version are explicitly present in nix metadata.
For example, my package uhubctl is incorrectly detected as:

pname: uhubctl-unstable
version: 2019-07-31

when it should be:

pname: uhubctl
version: unstable-2019-07-31

However, if you follow link to nix package from https://repology.org/project/uhubctl/versions, it has pname and version specified separately:
https://github.com/NixOS/nixpkgs/blob/master/pkgs/tools/misc/uhubctl/default.nix#L7-L8

Note that pname and version are explicitly present in nix metadata.

Not in json which repology uses. @volth ping? It also haven't updated since Aug 30th.

% curl --silent https://nixos.org/nixpkgs/packages-unstable.json.gz | gunzip | jq .packages.uhubctl
{
  "name": "uhubctl-unstable-2019-07-31",
  "system": "x86_64-linux",
  "meta": {
    "available": true,
    "description": "Utility to control USB power per-port on smart USB hubs",
    "homepage": "https://github.com/mvp/uhubctl",
    "license": {
      "fullName": "GNU General Public License v2.0 only",
      "shortName": "gpl2",
      "spdxId": "GPL-2.0-only",
      "url": "http://spdx.org/licenses/GPL-2.0-only.html"
    },
    "maintainers": [
      {
        "email": "pavol@rusnak.io",
        "github": "prusnak",
        "githubId": 42201,
        "keys": [
          {
            "fingerprint": "86E6 792F C27B FD47 8860  C110 91F3 B339 B9A0 2A3D",
            "longkeyid": "rsa4096/0x91F3B339B9A02A3D"
          }
        ],
        "name": "Pavol Rusnak"
      }
    ],
    "name": "uhubctl-unstable-2019-07-31",
    "outputsToInstall": [
      "out"
    ],
    "platforms": [
      "aarch64-linux",
      "armv5tel-linux",
      "armv6l-linux",
      "armv7l-linux",
      "mipsel-linux",
      "i686-linux",
      "x86_64-linux",
      "powerpc64le-linux",
      "riscv32-linux",
      "riscv64-linux",
      "x86_64-darwin",
      "i686-darwin",
      "aarch64-darwin",
      "armv7a-darwin"
    ],
    "position": "pkgs/tools/misc/uhubctl/default.nix:23"
  }
}
commented

Perhaps nix can add 2 more fields to json output: pname and version?
That should be still backwards compatible and solve this problem for repology.

It's mentioned in the discussion above, these fields are (or at least should be) present in the json. For instance, abcl:

{
  "name": "abcl-1.5.0",
  "system": "x86_64-linux",
  "meta": {
    "available": true,
    "description": "A JVM-based Common Lisp implementation",
    "homepage": "https://common-lisp.net/project/armedbear/",
    "license": {
      "fullName": "GNU General Public License v3.0 only",
      "shortName": "gpl3",
      "spdxId": "GPL-3.0-only",
      "url": "http://spdx.org/licenses/GPL-3.0-only.html"
    },
    "maintainers": [
      {
        "email": "7c6f434c@mail.ru",
        "github": "7c6f434c",
        "githubId": 1891350,
        "name": "Michael Raskin"
      }
    ],
    "name": "abcl-1.5.0",
    "outputsToInstall": [
      "out"
    ],
    "platforms": [
      "aarch64-linux",
      "armv5tel-linux",
      "armv6l-linux",
      "armv7l-linux",
      "mipsel-linux",
      "i686-linux",
      "x86_64-linux",
      "powerpc64le-linux",
      "riscv32-linux",
      "riscv64-linux"
    ],
    "position": "pkgs/development/compilers/abcl/default.nix:34",
    "version": "1.5.0"
  }
}

Not sure why's it not present for uhubctl.

Also, notably, there are no pnames in the json at all. While it seems like we can split name/version reliably by trimming version from name, it doesn't work in all cases in practice:

https://repology.org/log/1418803

...
winePackages.staging: ERROR: name "wine-4.14-staging" does not end with version "4.14"
wine-staging: ERROR: name "wine-4.14-staging" does not end with version "4.14"
wineWowPackages.staging: ERROR: name "wine-wow-4.14-staging" does not end with version "4.14"
...
xzoom: ERROR: name "xzoom-0.3.24" does not end with version "0.3"
...
commented

Can you use attribute name as package name, and work out version from it?
https://nixos.org/nixos/packages.html?channel=nixpkgs-unstable&query=uhubctl

Note that full json https://nixos.org/nixpkgs/packages-unstable.json.gz has a field package name in its hierarchy, e.g:

    "uhubctl": {
      "name": "uhubctl-unstable-2019-07-31",
      "system": "x86_64-linux",
      "meta": {
        "available": true,
        "description": "Utility to control USB power per-port on smart USB hubs",
        "homepage": "https://github.com/mvp/uhubctl",
        "license": {
          "fullName": "GNU General Public License v2.0 only",
          "shortName": "gpl2",
          "spdxId": "GPL-2.0-only",
          "url": "http://spdx.org/licenses/GPL-2.0-only.html"
        },
        "maintainers": [
          {
            "email": "pavol@rusnak.io",
            "github": "prusnak",
            "githubId": 42201,
            "keys": [
              {
                "fingerprint": "86E6 792F C27B FD47 8860  C110 91F3 B339 B9A0 2A3D",
                "longkeyid": "rsa4096/0x91F3B339B9A02A3D"
              }
            ],
            "name": "Pavol Rusnak"
          }
        ],
        "name": "uhubctl-unstable-2019-07-31",
        "outputsToInstall": [
          "out"
        ],
        "platforms": [
          "aarch64-linux",
          "armv5tel-linux",
          "armv6l-linux",
          "armv7l-linux",
          "mipsel-linux",
          "i686-linux",
          "x86_64-linux",
          "powerpc64le-linux",
          "riscv32-linux",
          "riscv64-linux",
          "x86_64-darwin",
          "i686-darwin",
          "aarch64-darwin",
          "armv7a-darwin"
        ],
        "position": "pkgs/tools/misc/uhubctl/default.nix:23"
      }
    },

This way your script can know that actual package name is uhubctl, and thus everything after dash in a name must be a version.

You'd still need to extract version part from name, and that won't work as attribute name is different from the name part in name. Besides, attribute names contains even more garbage than package names.

I recently added pname to the majority of nixpkgs derivations, so pname should be reliable.

This is great, awaiting it's properly exported via the dump.

I wonder, could it be possible that pnames in the dump won't contain addendums such as asciidoc-full asciidoc-full-with-plugins or arm-trusted-firmware-sun50iw1p1, arm-trusted-firmware-sun50i-h6, ...? These are so so numerous I refuse to merge them any more.

I wonder, could it be possible that pnames in the dump won't contain addendums such as asciidoc-full asciidoc-full-with-plugins or arm-trusted-firmware-sun50iw1p1, arm-trusted-firmware-sun50i-h6, ...? These are so so numerous I refuse to merge them any more.

We need different names as those are used to distinguish packages by nix-env. Hmm, perhaps we should add something like meta.repologyBasePackage for them.

I believe those predate the introduction of the the pname version split. Especially since it is done with this highly unidiomatic invocation. We should move that name modification to the package expression itself so that the package name is always the same when doing .override { enableStandardFeatures = true; }.

I consider our introduction of pname and version attributes to be an abstraction over the low-level package-agnostic Nix derivations. Ideally, there would be no name in Nixpkgs expressions, only in the low-levels of mkDerivation mapping it onto Nix’s derivation primitive. (And I have heard some interest in introducing native package primitive to Nix, which could, in the future, allow us to drop name altogether.)

By the way, in many packages we currently have unstable a part of version, rather than pname suggested in https://nixos.org/nixpkgs/manual/#sec-package-naming. I know I am guilty of promulgating this and we should probably fix that before switching Repology to pname & version.

Maybe we should discuss these issues in Nixpkgs issue tracker instead.

There is no introduction of pname version split ;) Each derivation has a name, even if it is not represent a package (as repology see a package). For example fetchurl and fetchpatch and runCommand and buildEnv are Nix derivations too, and they do have a name, but no pname nor version.

I am aware that there is no actual package primitive but we do have packages on conceptual level and I would consider the introduction of pname support to mkDerivation instrumental in the imaginary breaking away of package from derivation. I consider the fact that a package is also a derivation with a name an implementation detail.

We need different names as those are used to distinguish packages nix-env

I belive that since pname/versions were just recently introducted, nix-env just cares of name, so pname could still contain upstream name, and name may consist of more parts than just pname and version as volth suggests

name is not always equal to pname+"-"+version, it is just a default value.

pname = "asciidoc";
version = "1.0";
name = "${pname}-full-with-plugins-${version}";

Hmm, perhaps we should add something like meta.repologyBasePackage for them.

If it can't be pname, just basePackage or basename or alike if it's possible. I don't believe there's anything specific to Repology here and I don't want repos to introduce any Repology specific things. Something close to upstream project name has a lot more uses than just Repology.

I belive that since pname/versions were just recently introducted, nix-env just cares of name, so pname could still contain upstream name

We could make nix-env use pname but currently, that attribute is Nixpkgs only concept. It would be more systematic to open an RFC to make Nix aware of pname rather than handle it ad-hoc in different places.

name may consist of more parts than just pname and version as volth suggests

As I explained above, I am not very fond of this idea since pname is the name of the package, not the project; and the cases where name is different than pname are historical relics.

  • name – Derivation name; an internal detail of Nix language; for packages, it is typically ${pname}-${version}.
  • pname – Package name; Nix does not recognize this, only as a part of ${name} before the first dash that is followed by a number; used for finding the package with nix-env (for performance reasons, you are better off using the attribute path, though)

We could to redefine and the values as follows:

  • name – Derivation name; an internal detail of Nix language; for packages, it is typically ${pname}-${variant}-${version}.
  • pname – Project name; when variant is not specified this is also a package name.
  • variant – Build configuration name; used to distinguish different package variants in nix-env

Alternately, we could consider dropping the variant names from package names, as they are primarily used by the slow legacy nix-env (everything now uses attribute paths). The only other place they figure are Nix store paths but it is not granular enough to describe the expression anyway.

Hmm, perhaps we should add something like meta.repologyBasePackage for them.

If it can't be pname, just basePackage or basename or alike if it's possible. I don't believe there's anything specific to Repology here and I don't want repos to introduce any Repology specific things. Something close to upstream project name has a lot more uses than just Repology.

I agree. Something like canonicalPackage could be actually useful for our auto-update infrastructure.

We could to redefine and the values as follows:

  • name – Derivation name; an internal detail of Nix language; for packages, it is typically ${pname}-${variant}-${version}.
  • pname – Project name; when variant is not specified this is also a package name.
  • variant – Build configuration name; used to distinguish different package variants in nix-env

For Repology this is the most suitable (as long as pname is published). Am I missing something, or is separate canonicalPackage not needed in this schema because there's variant available?

Yeah, we have two parallel ways for selecting packages:

You can let Nix traverse the whole nixpkgs and try to find a derivation (think JSON document with the package data) with matching pname portion of name attribute. This is the very slow but it is still supported for legacy reasons.

Or you can get the derivation directly if you know the attribute path that leads to it (e.g. pkgs.python3Packages.numpy).

Obtaining any attribute from the derivation is easy, but getting the derivation from pname is inefficient.

Updaters will want to know the attribute path or canonical derivation itself (Nix is lazily evaluated) to spare themselves of the slowness of the first method.

Actually, what about using the attribute paths (the keys in the JSON file) instead of derivation names for Repology?

Actually, what about using the attribute paths (the keys in the JSON file) instead of derivation names for Repology?

As I've mentioned above, these contain even more garbage than names.

So, I've somehow resolved this issue - I've refactored the parser a bit, it's now simpler and more straightforward in detecting bad data. The update is no longer blocked on separate pname/version from nix, but I'm ready to switch to them as soon as these are provided. All incorrectly named packages are now dropped and logged. There's about 100 of them, so not too many, any they are all listed in parse log (https://repology.org/repositories/updates#nix_unstable). The log contains much less noise now too (repology/repology-rules#297).

While here, I'd like to mention that the new package names (in all their variety) handling mechanism (#931) is now in place, and it could make use of attribute name mentioned by @jtojnar, so I wanted to ask about it's meaning (and difference to pname). In short, Repology now stores a set of names of different purposes for each package:

  1. a string to derive project name from
  2. a string to show to the user as a package name (sometimes more human readable names are available, such as Firefox Browser)
  3. an identifier used to track package in time, e.g. a most stable name (e.g. one not susceptible to changes like py36-foopy37-foo, or foofoo-client, or foofoo-compat0)
  4. a set of names used to refer the package from the outside (repology/repology-webapp#66), currently:
    • source package name
    • binary package name
    • generic name (useful when neither of above is applicable)

I wonder if attribute name would be useful as 3 or some of 4.

make use of attribute name mentioned by @jtojnar, so I wanted to ask about it's meaning (and difference to pname).

Nixpkgs is structured as a nested (in Nix parlance) attribute sets (very similar to Python dictionaries or JSON objects). The attribute name (or rather attribute path) denotes along which attributes (keys) you need to traverse to get to a desired package. Most packages are in the top-level attribute set but some things like Haskell, Python libraries, part of GNOME platform packages, Qt libraries… are nested under a common attribute set in the top-level (e.g. gnome3.gnome-boxes or python37Packages.setuptools).

The attribute path is used to unambiguously refer to a package in the package set. There are aliases (e.g. python3Packages = python3.pkgs, python3 = python37, bustle = haskellPackages.bustle, openal = openalSoft), so the mapping from attribute paths to packages is not injective.

For technical reasons, we sometimes also have multiple variants of a single project built with different configure flags, for example:

  • poppler, libsForQt5.poppler, poppler_gi and poppler_utils
  • python27Packages.setuptools and python37Packages.setuptools
  • libpulseaudio, pulseaudio and pulseaudioFull

Those will all be listed in the generated JSON file and may not necessarily have a different name attribute (for instance, they are libpulseaudio-12.2, pulseaudio-12.2 and pulseaudio-12.2, respectively for the aforementioned PulseAudio packages).

So I agree that attribute name would be potentially useful as 3. But as we do not have a notion of a canonical attribute path, the JSON dump chooses python37Packages.setuptools instead of more proper python3.pkgs.setuptools for reasons unknown to me.

And attribute paths can still change. For example, Nixpkgs did not allow dashes in attribute names in the past, so we used dashes instead. Now that the policy is somewhat relaxed, we are gradually moving packages to paths more in line with uppstream name.

But yeah, the attribute path should still be the most stable identifier.


Regarding pname: Nix does not have a concept of project name or package version. There is just a name name attribute in a derivation (something like a concrete realization of a package) that was probably designed as a hint to disambiguate the installation paths (hashes) under /nix/store.

Since the hash of the package expression and the transitive closure of expressions of its dependencies is the primary disambiguator, the name can be basically anything but commonly we use the ${runtime}-${runtime version}-${project name}-${configuration variant}-${project version} (parts omitted when not needed). So the luajitPackages.nvim-client attribute path points to a derivation with a name attribute luajit-2.1.0-beta3-nvim-client-0.2.0-1, that is nvim-client version 0.2.0-1 for luajit version 2.1.0-beta3. We have the same project parametrized with a different runtime under lua53Packages.nvim-client and it has a name lua5.3-nvim-client-0.2.0-1.

Of course the heuristic for extracting version from derivation name (basically a equivalent to regex like (?P<pname>.+?)(?:-(?P<version>[0-9].+))?) is no match for a unstructured attribute like name.


For 4, yeah, the attribute patches would fit the generic name, as it is what we use to refer to packages within nixpkgs.

The closest thing we have to a source packages is a drv file which contains instructions what other derivations to obtain and how to build them into a store path, but those are instantiated locally from the expression located by an attribute path. And as for binary packages, we just check if our binary cache does not already contain a store patch for derivation with that hash and just download it to the store if it does.