hugovk / top-pypi-packages

A regular dump of the most-downloaded packages from PyPI

Home Page:https://hugovk.github.io/top-pypi-packages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use non-lowercased project names

jayvdb opened this issue · comments

All project names are lower case, not matching the name shown on pypi.org. e.g. pyyaml instead of PyYAML. I suspect that may be the data this project has, in which case the problem is upstream.

That lowercasing is not very helpful - the name of projects can (and does) change over time in all sorts of ways, not just the case.

Applying lowercase can be done after the fact - it is a simple transform, but it is not reversible without the post-processing of all entries as suggested in the followup comments on #1

My use-case is I need to match the list up with openSUSE package names, which must use the PyPI package name, exactly, including casing and hyphen-vs-dash. The task is slightly more difficult and slower if I dont have the exact name to begin with.

If it cant be obtained from the source data, it is likely quicker for me to add post-processing to get the real name , rather than try to get exact results from case insensitive openSUSE package searches.

This repo doesn't alter the names, it dumps the result from pypinfo:

/usr/local/bin/pypinfo --json --indent 0 --limit 5000 --days 30 "" project > top-pypi-packages-30-days.json

Having a quick look in pypinfo, it's not changing the name of projects received from the Google BigQuery client.


pypinfo does have this:

def normalize(name):
    """https://www.python.org/dev/peps/pep-0503/#normalized-names"""
    return re.sub(r'[-_.]+', '-', name).lower()

But that's only used for normalising the input when wanting info about a single project, and is blank in this case.

https://www.python.org/dev/peps/pep-0503/#normalized-names says:

This PEP references the concept of a "normalized" project name. As per PEP 426 the only valid characters in a name are the ASCII alphabet, ASCII numbers, ., -, and _. The name should be lowercased with all runs of the characters ., -, or _ replaced with a single - character. This can be implemented in Python with the re module:

(And then gives the same function.)


I didn't check if the Google BigQuery can also return the un-normalised name, if so, that'd need a change to pypinfo before being added here.

If that's not possible or easy, then I'd be fine adding extra data here. Rather than post-processing, I think a second JSON file would be better rather than post-processing.


Or are the openSUSE package names identical to the PyPI names (eg. PyYAML)?

If so, can you normalise PyYAML into pyyaml and then use the data here?

Or are the openSUSE package names identical to the PyPI names (eg. PyYAML)?

yes, with a python- prefix.

https://build.opensuse.org/package/show/openSUSE:Factory/python-PyYAML

I would prefer to be using this data first, and looking up against openSUSE, rather than the other way around, or building a database of both and cross referencing.

I'll see what is happening inside pypinfo

The schema is at https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20161022?tab=schema , and both url and file.filename have the proper project name, and I have got them working with adhoc queries. So now I just need to propose a PR to pypinfo to use the filename. It might be slightly slower, depending on whether bigquery supports some more advanced SQL join syntax, and possibly even using https://bigquery.cloud.google.com/table/the-psf:pypi.simple_requests instead.

Sounds good! One concern is the amount of BigQuery quota used, to ensure two requests can be made each week with the free quota. Hopefully it won't increase the amount used too much, but it would be nice to see the difference.

pypinfo reports how big each query is, you can see it in the json here.

Good list! (I need to make a list of things using this data, too.)

Of those, https://github.com/psincraian/pepy and https://github.com/crflynn/pypistats.org are websites which essentially cache BigQuery data.

The latter is especially good and has an API, for which I've written a CLI client:

https://pypistats.org/api/
https://github.com/hugovk/pypistats

The data is limited to 6 months, and both pepy and pypistats.org don't have this specific mapping we're talking about. But maybe they could?

One concern is the amount of BigQuery quota used, to ensure two requests can be made each week with the free quota.

It shouldnt be extra queries - just slightly slower queries, assuming the SQL engine is halfway decent.

Based on your recommendation, I've created issues in both of those projects to see which, if any, have an interest.

You'll be interested to learn that pepy is growing an API psincraian/pepy@b3cf4ee

Now I have the SQL changes needed (see queries at psincraian/pepy#128 (comment)), I've also created an issue at ofek/pypinfo#73 before doing the change there.