PyConCanada2015

My scrapers + data + analysis for PyConCanada2015 Keynote

(sorry github)

Frequency of libraries in `requirements.txt` files in Github Python repositories

This was done by scraping 10k+ Python repositories on Github that contain a requirements.txt file. This file is commonly used to store dependencies of the repository.

It's clear that the majority of repositories on Python are web development related, or web developers are most likely to include a proper requirements.txt file in their repositories.

Relationships between libraries

Using the data in `requirements.txt' files, we can find common co-occurences of libraries. For example, it's not hard to imagine that whenever django is a requirement, so is psycopg2. In fact, in the dataset I had, 41% of all django apps also included psycopg2. These relationships can be mined using a simple algorithm called the apriori algorithm. It's history goes back to large department stores that were interested in what products were commonly bought together. The naive solution, compare all possible pairs, results in a quadratic algorithm - and if you have thousands of products, this becomes inefficient quickly. The apriori algorithm intelligently cuts through this massive space.

Here are the other common libraries paired with django:

(confidence is defined as

confidence = P(ending_with | starting_with)
           = P(starting_with and ending_with) / P(starting_with)
           = #{requirement.txts with both} / #{requirement.txts with starting_with}

starting_with	ending_with	starting_with_occurrences	confidence	occurrences	ending_with_occurrences
django	requests	2714	0.243920412675	662	2463
django	wheel	2714	0.22402358143	608	1649
django	six	2714	0.245394252027	666	1985
django	psycopg2	2714	0.411569638909	1117	1573
django	gunicorn	2714	0.320191599116	869	1531
django	dj-database-url	2714	0.263448784083	715	728

Here are the results for other libraries including some metrics to sort on. To read more about these metrics, see this link.

Now, let's recommend libaries based on these relationships

So, if we know a user installed django, we can perhaps recommend that they also install psycopg2 (according to above, we would be right 41% of the time). We can turn these co-occurences into a very simple recommendation algorithm for Python Libaries! So I've gone ahead and done that.

`pipp`: one of the `p`s stands for personalized!

Yes, that's right - we can bring you library recommendations right to the command line. Try it out!

pip install pipp

$ pipp install jsmin
Requirement already satisfied (use --upgrade to upgrade): jsmin in /Users/camerondavidson-pilon/.virtualenvs/data/lib/python2.7/site-packages
pipp: Other users who installed jsmin also installed cssmin

Command line too nerdy for you? How about recommendations on PyPI?

Network force-layout of libraries in `requirements.txt` files in Github Python repositories

This is biased, as some libaries have their own requirements. For example, Pandas depends on Numpy, so it would be less common to have both Pandas and Numpy in a requirements.txt file.

The Plural of Ancedote is Data!

I've often heard techies say the plural of ancedote is not data. I see where they are coming from, however I think they are being shortsighted. Whereas one or two occurences of something is not enough evidence to prove a fact, it is evidence of something interesting. And if you have a tool to quickly confirm or deny further occurences of this anecdote, then yes you have data. For example how often do you see links to stackoverflow questions in code? I have seen it before, and I wondered, how common is this?

Using one of the greatest anecdote validation tools, Search, we can validate this idea:

Great, now let's start scraping. Here are the most common questions linked in Python code:

stackoverflow.com/questions/19622133    1173
stackoverflow.com/questions/279237       887
stackoverflow.com/questions/5658622      320
stackoverflow.com/questions/22019341     134
stackoverflow.com/questions/35817        117
stackoverflow.com/questions/1769332       89
stackoverflow.com/questions/377017        86
stackoverflow.com/questions/1189781       73
stackoverflow.com/questions/4124220       70
stackoverflow.com/questions/701802        66

Let's investigate the first one. It's a very specific question about windows and ctypes - not a common problem in the first place. If we search for just that url on Github, we see it's all from the same file, windows_support.py. Investigating those repos with the url, we see that 1. not only is this is from code inside Python 2.7, but 2. people are including all of Python 2.7 in the Github repos!

Most controversial Python StackOverflow answer

StackOverflow has become the most popular forum for developers to ask, answer and importantly promote or demote content. StackOverflow does something even more incredible: they expose all their interaction data (questions, answers, views, votes) through a public query interface. Using this, we can compute, what is the most controversial Python answer?

To do this, we will use the following algorithm: find the answer that has an upvote/downvote ratio close to 0.5, and also has lots of votes. The former requirement is a good definition of "controversial", and the latter requirement protects use against answers with trivial counts (ex: 1 upvote and 1 downvote). Think of it as a balancing act between "how confident are we that this question is indeed the most controversial?" The following query accomplishes this (based on a similar equation in this post)

declare @VoteStats table (parentid int, id int, U float, D float) 

insert @VoteStats
SELECT 
  a.parentid,
  a.id,
  CAST(SUM(case when (VoteTypeID = 2) then 1. else 0. end) + 1. as float) as U,
  CAST(SUM(case when (VoteTypeID = 3) then 1. else 0. end) + 1. as float) as D
FROM Posts q
JOIN PostTags qt 
  ON qt.postid = q.ID
JOIN Tags T 
  ON T.Id = qt.TagId
JOIN Posts a 
  ON q.id = a.parentid
JOIN Votes 
  ON Votes.PostId = a.Id
WHERE TagName  = 'python'
   and a.PostTypeID = 2 -- these are answers
Group BY a.id, a.parentid

set nocount off

SELECT 
 TOP 100
 parentid,
 id,
 U, D,
 ABS(0.5 - U/(U+D) - 3.5*SQRT(U*D / ((U+D) * (U+D) * (U+D+1)))) + 
   ABS(0.5 - U/(U+D) + 3.5*SQRT(U*D / ((U+D) * (U+D) * (U+D+1)))) as Score
FROM @VoteStats 
ORDER BY Score

Running this produces the following table (as of Oct. 24, 2015):

parentid	url	U	D	Score
1641219	http://stackoverflow.com/questions/1641305	100	58	0.267581687129904
366980	http://stackoverflow.com/questions/367082	55	29	0.360985397926758
904928	http://stackoverflow.com/questions/904941	44	40	0.379197639329681
1641219	http://stackoverflow.com/questions/1945699	49	23	0.382002382488145
734368	http://stackoverflow.com/questions/734910	48	30	0.38315203605798
7479442	http://stackoverflow.com/questions/7479473	46	23	0.394405318873308
620367	http://stackoverflow.com/questions/620397	42	24	0.411383595098925
969285	http://stackoverflow.com/questions/969324	49	20	0.420289855072464
1566266	http://stackoverflow.com/questions/1566285	39	24	0.424918292799399

The closer the score is to 0, the more controversial it is. Take a look at the answers comment's to see debates about why the answer is controversial.

2-Spaces vs 4-Spaces

Let's not argue: let's look at the empirical data. I looked at over 23 thousand Python repos and computed what the most common indenting practice was in each repo. The results were quite infavor of 4-spaces: 88% of repos used 4-spaces, and only 7% of repos use 2-spaces. What about the remaining 5%? Well, some repos use 8-spaces, and some used 1-spaces! Examples: https://github.com/aqt01/UnderWaterWorld uses 8-spaces, and https://github.com/sanglech/CSC326 uses 1-space.

What is the most popular testing framework?

Passing through the tens of thousands of repos, I looked for imports of the most popular testing libaries: pytest, unittest, nose and testify. Here where the results:

package	count	percent of total
None	22162	86%
unittest	3032	12%
nose	379	1.5%
pytest	293	1%
testify	4	~0%

What about using Python for functional programming?

If you are going to use Python for functional programming, or semi-functional programming, you're probably going to be using libraries like functools, 'itertools', 'toolz' and others. How many Python repos use this style of programming? Data shows about 15% of repos do this.

How often do we disobey flat is better than nested?

from com.sun.org.apache.xerces.internal.impl.io import \
            MalformedByteSequenceException

(from here)

Is this ugly or beautiful? Python says it's ugly - after all, flat is better than nested. How often we break this? For this, I looked at the maximum import nest in each repo. Here's the breakdown:

Topic modelling Python source code using LDA

What happens when we apply a topic modelling algorithm, like Latent Dirichlet Allocation, to hundreds of thousands of Python source code files? To be clear: this is not something you usually do! Topic modelling is meant to articles and reviews: human-readable text. Python code, on the other hand, is full of keywords in illogical order, repeated words over and over again, and developers use odd acronymns and abbreviations for all their variables! But, let's try it anyways.

After training LDA on the repos and library I downloaded, I came out with these topics. For example, we can see the topic:

python, version, package, author, setup, description, language, copyright, packages, license

obviously this is the setup.py topic

test, equal, case, tests, foo, unittest, equals, suite, result, expected

this is the testing topic,

grid, color, plot, plt, label, step, data, width, ax, size

the matplotlib plotting topic.

See if you can find others in the output above.

haoybl / PyconCanada2015

PyConCanada2015

My scrapers + data + analysis for PyConCanada2015 Keynote

Frequency of libraries in `requirements.txt` files in Github Python repositories

Relationships between libraries

Now, let's recommend libaries based on these relationships

`pipp`: one of the `p`s stands for personalized!

Network force-layout of libraries in `requirements.txt` files in Github Python repositories

The Plural of Ancedote is Data!

Most controversial Python StackOverflow answer

2-Spaces vs 4-Spaces

What is the most popular testing framework?

What about using Python for functional programming?

How often do we disobey flat is better than nested?

Topic modelling Python source code using LDA

Conclusion

About

Languages

PyConCanada2015

My scrapers + data + analysis for PyConCanada2015 Keynote

Frequency of libraries in requirements.txt files in Github Python repositories

Relationships between libraries

Now, let's recommend libaries based on these relationships

pipp: one of the ps stands for personalized!

Network force-layout of libraries in requirements.txt files in Github Python repositories

The Plural of Ancedote is Data!

Most controversial Python StackOverflow answer

2-Spaces vs 4-Spaces

What is the most popular testing framework?

What about using Python for functional programming?

How often do we disobey flat is better than nested?

Topic modelling Python source code using LDA

Conclusion

About

Languages

Frequency of libraries in `requirements.txt` files in Github Python repositories

`pipp`: one of the `p`s stands for personalized!

Network force-layout of libraries in `requirements.txt` files in Github Python repositories