Dependency Graph

Question

Dependency Graph

Fazel94 opened this issue 9 years ago · comments

If you upload all meta data or just their dependency in some easy to use format like xml , json or even an mySQL full db dump, I can implement a dependency graph and thus answer your blog post questions.
I can implement a adoption of page rank or similar algorithm to find the impact factor of packages.

Martin Thoma · Answer 1 · Sun Dec 06 2015 22:44:06 GMT+0800 (China Standard Time)

@Fazel94 Thank you for offering your help. Could you please tell me which blog article you refer to and which data I should upload?

Mohamad Fazeli · Answer 2 · Sun Dec 06 2015 23:10:51 GMT+0800 (China Standard Time)

Sorry,
Here is the post I'm talking of
http://martin-thoma.com/analyzing-pypi-metadata/

I would be glad to mine PyPI data. But it is you pleasing for me to get around scraping PyPI myself.
I mean as formatted ( as well as it is not a burden for you) data base or serialized version of meta data, specially the dependency list for each package, so I can make a dependency graph on It and may be do a little frequent item set counting to extract which packages people use together.

Thank you for your attention.

Martin Thoma · Answer 3 · Mon Dec 07 2015 06:02:43 GMT+0800 (China Standard Time)

specially the dependency list for each package

There is no such thing as a dependency list of each package in PyPI metadata. You could only download all the packages (completely), look for a requriements.txt and read that.

I can upload the data. However, it is quite a bit. I'm currently running the script again. The scripts beginning with "c" are currently running and even a 7z-compressed csv version of the packages table is about 3 MB.

Would that still be of use for you? If you really want to build the dependency graph, you have to download a quite massive amount of data. Estimating with the query

SELECT sum(size)/1000000000 FROM `urls`

it is currently about 3.3GB. I can give you a better approximation tomorrow.

Where should I upload it?

Martin Thoma · Answer 4 · Mon Dec 07 2015 07:44:29 GMT+0800 (China Standard Time)

Currently it is at pyromancer and 16.35GB.

I've added a scripts to check for imports in a package.

TODOs are:

apply that script to the latest versions of all packages in PyPI
analyze the setup.py

Done:

download the Python package
extract it
get the python files
insert the gathered data into the database
(add a new table to the database for dependencies)

Martin Thoma · Answer 5 · Mon Dec 07 2015 16:33:00 GMT+0800 (China Standard Time)

Ok, I've just put some more work in it:

Download most of the metadata here: https://www.dropbox.com/s/dzqk3rrqzpgmp58/export.7z?dl=0 (2015-12-06 - about 54 MB in 7z compressed format)
All packages combined are about 24.5 GB (probably compressed)

If you really want to make the dependency graph, you still have to:

implement the get_setup_packages in package_analysis.py
run ./package_analysis for all latest releases

This will fill your database with all possible dependencies. Even if you don't implement get_setup_packages it will add probably all dependencies. However, even with a VERY good internet connection I expect that this will probably take several days to run. One could parallelize the download of the packages, but that would still need many hours.

Martin Thoma · Answer 6 · Mon Dec 07 2015 21:23:39 GMT+0800 (China Standard Time)

@Fazel94 I've just made the script to run it over the complete PyPI database. That will take quite a while. And it corrently ignores setuptools, which is a major issue (but was too complicated to make a secure / fast implementation within just a couple of hours - you could add that, if you want).

How would you like to visualize the graph? It has 67582 nodes and a lot more than 4600 edges (I'm just downloading / building the graph... takes a while). You cannot use graphviz for that.

(By the way, do we know each other? Are you a student from KIT, too?)

By now, the most imported module is os, followed (not even close) by sys, logging, re ... and org. I guess that is an error? I have no idea where that comes from.