hugovk / top-pypi-packages

A regular dump of the most-downloaded packages from PyPI

Home Page:https://hugovk.github.io/top-pypi-packages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is it possible to add repo name to top-pypi-packages.json?

cclauss opened this issue · comments

  {
    "download_count": 282748018,
    "project": "simplejson",
    "repo": "https://github.com/simplejson/simplejson"
  },

If the project's "home page" is on github.com or github.io, we can probably make educated guesses.

If not, we can create a yaml file to make that translation. We would need to keep the yaml file up to date as new projects appear.

It also brings up the issue of how to support non-GitHub-based projects like GitLab, etc.

There could be a post-processing step from top-pypi-packages.json.

open top-pypi-packages.json
for each package in top-pypi-packages:
  if no repo for package:
    fetch JSON from PyPI eg. https://pypi.python.org/pypi/simplejson/json
  if "github" or "gitlab" or something in url:
    mangle the link and store this as repo
  elif "github" or "gitlab" or something in description or long_description:
    extract and mangle the link and store this as repo
save top-pypi-packages.json

Perhaps the mapping of project -> repo would be better in a second JSON file? That way, as projects drop off the bottom and join back at the bottom, they won't be lost and need re-adding. Also any manual corrections or additions won't be lost either.

I've written a couple of scripts to make a separate JSON file of repos, have a look at:

Currently, it finds 3,951 repos for the top 5,000 packages. I'm not planning on automating this, but can run it from time to time to update it.