AndyFou / github-languages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Github Languages Network Analysis

Language Graph

Description

This is a university project about the analysis of the languages used by the github community and some sub-communities discriminated by location.

Data

500 users since the beginning of Github & ~150000 users from 2012 to 2013 (narrowed down to ~40.000 that stated their location)

Procedure

Collection of locations (as giver by the user), languages (of user's public repos) & other features (nof public_repos, nof followers)

  • modelling languages (connections via users (pairs of languages)) & detect communities using association rules
  • applying the same on some particular locations (UK, USA - SF, Australia, Germany, Brazil, Greece) - ίσως τα τοπ μερη; ή οι μεγαλύτερες/πολυπληθέστερες χώρες στον κόσμο; - ίσως αναλογικά αυτό
  • applying the same only for highly influential users (meaning with many followers and/or many public_repos)

Challenges

  • locations issues: empty loc, not all same structure, some are not real, special characters not recognized by encoding (ex mxico, so paolo/paulo, zrich, montral, malm, florianpolis,dsseldorf)
  • languages issues: access denied on repositories, only public repos (no forks or private)

Ideas

  • Get lang graph for locations

  • Get lang graph for influential users

  • Association Rules for Community Detection

  • Get graph of users: nodes-users edges-common langs

  • Get top countries for developers by a source (eg http://goo.gl/vBF6BE or https://goo.gl/bwxcoZ)

Descriptive Statistics: users per location (top pie / top barchart), langs and bytes

Notes

  • το cluster του web το βγαζει πολυ καλα σε σχεση με αλλα (λογικο απο τη φυση των δεδομενων, το github παιζει πολυ με web devs)
  • results-loc: some places have too many users (SF) and most have too little (powerlaw)

Interesting Sources

Dependencies

About

License:MIT License


Languages

Language:Python 88.0%Language:HTML 12.0%