MartinThoma / algorithms

This repository is for learning and understanding how algorithms work.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dependency Graph

Fazel94 opened this issue · comments

If you upload all meta data or just their dependency in some easy to use format like xml , json or even an mySQL full db dump, I can implement a dependency graph and thus answer your blog post questions.
I can implement a adoption of page rank or similar algorithm to find the impact factor of packages.

@Fazel94 Thank you for offering your help. Could you please tell me which blog article you refer to and which data I should upload?

Sorry,
Here is the post I'm talking of
http://martin-thoma.com/analyzing-pypi-metadata/

I would be glad to mine PyPI data. But it is you pleasing for me to get around scraping PyPI myself.
I mean as formatted ( as well as it is not a burden for you) data base or serialized version of meta data, specially the dependency list for each package, so I can make a dependency graph on It and may be do a little frequent item set counting to extract which packages people use together.

Thank you for your attention.

specially the dependency list for each package

There is no such thing as a dependency list of each package in PyPI metadata. You could only download all the packages (completely), look for a requriements.txt and read that.

I can upload the data. However, it is quite a bit. I'm currently running the script again. The scripts beginning with "c" are currently running and even a 7z-compressed csv version of the packages table is about 3 MB.

Would that still be of use for you? If you really want to build the dependency graph, you have to download a quite massive amount of data. Estimating with the query

SELECT sum(size)/1000000000 FROM `urls`

it is currently about 3.3GB. I can give you a better approximation tomorrow.

Where should I upload it?

Currently it is at pyromancer and 16.35GB.

I've added a scripts to check for imports in a package.

TODOs are:

  • apply that script to the latest versions of all packages in PyPI
  • analyze the setup.py

Done:

  • download the Python package
  • extract it
  • get the python files
  • insert the gathered data into the database
  • (add a new table to the database for dependencies)

Ok, I've just put some more work in it:

If you really want to make the dependency graph, you still have to:

  • implement the get_setup_packages in package_analysis.py
  • run ./package_analysis for all latest releases

This will fill your database with all possible dependencies. Even if you don't implement get_setup_packages it will add probably all dependencies. However, even with a VERY good internet connection I expect that this will probably take several days to run. One could parallelize the download of the packages, but that would still need many hours.

@Fazel94 I've just made the script to run it over the complete PyPI database. That will take quite a while. And it corrently ignores setuptools, which is a major issue (but was too complicated to make a secure / fast implementation within just a couple of hours - you could add that, if you want).

How would you like to visualize the graph? It has 67582 nodes and a lot more than 4600 edges (I'm just downloading / building the graph... takes a while). You cannot use graphviz for that.

(By the way, do we know each other? Are you a student from KIT, too?)

By now, the most imported module is os, followed (not even close) by sys, logging, re ... and org. I guess that is an error? I have no idea where that comes from.