- Copyright (c) 2017 Takahiro Komamizu
- License: CC-BY-SA-4.0
- Author homepage
- Resaerch paper > http://dx.doi.org/10.18293/SEKE2017-109 (to appear)
- Please cite
Takahiro Komamizu, Yasuhiro Hayase, Toshiyuki Amagasa, Hiroyuki Kitagawa,
“Exploring Identical Users on GitHub and Stack Overflow”,
in Proc. the 29th International Conference on Software Engineering and Knowledge Engineering (SEKE 2017),
pp.584-589, Pittsburgh, USA, July 5-7, 2017
- Dump of SO (PostgreSQL dump)
- Dump of GH (mySQL dump)
- source:
http://ghtorrent.org/downloads/msr14-mysql.gz(no more available)- Instead, you can get from GHTorrent (http://ghtorrent.org/downloads.html)
- mysql-postgresql-converter to make it PostgreSQL dump
- source:
- User pairs from SO and GH, who have same email (through MD5 hashing)
- data/common_users.csv
- Both dumps of SO and GH are in the same database (namely, gh_so).
- Rename tables in SO into "so_" + original name.
- Because easy to join tables between SO and GH.
- Common users are stored in the database (gh_so).
- Table: gh_so_common_users (so_user_id, gh_user_id)
- Down sampling of negative data
- Making pairs of GH (resp. SO) users in gh_so_common_users with SO (resp. GH) users not in gh_so_users
- Find same numbers of pairs in gh_so_common_users for GH and SO, respectively.
- Tool:
- learning/negativeDataGen.py
- Making pairs of GH (resp. SO) users in gh_so_common_users with SO (resp. GH) users not in gh_so_users
- Project descriptions as GH users' feature
- Project decriptions are considered to reflect interests of GH users
- Tool:
- learning/user_project_dec_mapping.sql
- Labeled data
- Label positive pairs and negative pairs with 1 and 0 respectively.
- Tool or data:
- learning/createView.sql
- data/labeled_data.csv
- Similarity on date attributes
- Inverse of date duration
- GH: created_at (timestamp)
- SO: creation_date (date)
- Tool:
- learning/dateSimilarity.py
- Similarity on name attribute
- Trigram-based similarity
- GH: name (text)
- SO: display_name (text)
- Tool:
- learning/nameSimilarity.py
- Similarity on location attributes
- TFIDF-based similarity
- GH: location (text)
- SO: location (text)
- Tool:
- learning/locationSimilarity.py
- Similarity on descriptive attributes
- TFIDF-based similarity
- GH: project description (text)
- SO:
- aboutme (text)
- comments (text)
- post body (text)
- post title (text)
- post tags (text)
- Tool:
- learning/descVsAboutme.py
- learning/descVsComment.py
- learning/descVsPosts.py
- Similarity matrix construction
- For each pair of GH user and SO user, above similarities are associated as features.
- To learn classification models, features of all learning pairs are made as matrix.
- Tool:
- learning/similarityMatrixGen.py
- data/s.mtx
- Learning classifiers
- Methods: linear regression (lr), k-nearest neighbor klnn), logistic regrassion (lg), random forest (rf), and gradient boosting decision tree (gndt)
- Tool:
- learning/classifierLearning.py
- models.zip
- models/*.pkl
- Learned models
- models/*.pkl
- Learned model in models/xxx.pkl
- In this project, the models directory is zipped due to the file size limitation, so users have to unzip it first.
- Prediction module
- The module using a selected model predicts identities of randomly selected pairs of users.
- This process is time-comsuming if the number of pairs is large, so, by default, only 50000 pairs are computed.
- Tool:
- predict/predict.py
- predict/writePredicted.py
- data/predicted.tsv
- the predicted user pairs with probability
- data/predicted.tsv
- The module using a selected model predicts identities of randomly selected pairs of users.