Identifying topical experts on Twitter using information from StackOverflow and Quora
- Redis database == v3.2.0 (Compile from Redis.org source, not brew install!)
- Scrapy crawler for crawling Quora
- Link up users from Twitter to their accounts on Quora and StackOverflow
- Discover Twitter accounts using a few seed accounts on Twitter Lists, and do a recursive crawl
- Decision of when to stop crawling a non-trivial one which requires investigation
- Identify foreign accounts using i) Name jaro-wrinkler string-similarity search, ii) Location string matching, iii) Goldberg profile image similarity
- Repeat crawl periodically every 7 days
- Native - tag-score on StackOverflow, number of views on Quora, ? (no native methods) on Twitter i) Following Cognos, for each user, obtain topic vector {(t, f)} where t is set of topics (tags for StackOverflow, topics for Quora on user profile page, inferred from lists for Twitter) and f is i)score for tags for StackOverflow, ii)number of views for topic (use most viewed writers page) on Quora, iii)frequency of occurence of topic in the names and desc of lists containing the user ii) Obtain topical similarity between the topic vector and search query using Cover Density Ranking, multiply by log(f) to arrive at final rankings
- PageRank - using followee-follower graph of data on Quora, StackOverflow? (BONUS)
- ExpertiseRank (variant of PageRank) - http://www.ladamic.com/papers/expertiserank/ExpertiseRank.pdf, extract User-helped->User graph and run PageRank (BONUS)
- Tier 1 weighting - proportion of time user spends on each OSN (users/id/network_activity for SO, tweet frequency on Twitter, www.quora.com/profile/{username}/activity on Quora)
- Tier 2 weighting - MAUs of each OSN
Analysis - How did the inclusion of external OSNs i) Change the ranking of experts on Twitter and ii) Improve the system overall
- Using human evaluators - blind testing (BONUS)
- How to automate seeding of accounts in order to automatically discover new topics
- What if we change the analysis to start from Quora and find matching accounts on Twitter? (Much faster and easier, because we are starting with experts on Quora and after we find the matching twitter account, we can follow Cognos methodology to arrive at expert ranking)
- How do we determine the activity of each user on a OSN (Activity on Quora, last updated on StackOverflow, )
- How to overcome IP based throttle limits imposed by SE
- How to increase overall throughput to process 3 mil users in 7 days (0.2s per user)? Now, it takes ~ 10s per user, 347 days in total
- How to ensure all topics on Quora are scraped. How to ensure new topics are covered? (Alphabetical site map, not sure how often it is updated)
- How to maximize data extraction from Quora (Fast in, fast out)
"twitter_screen_name" (HASH)
"twitter_id" -> "18193572"
"twitter_name" -> "Jon Skeet"
"twitter_screen_name" -> "jonskeet"
"twitter_profile_image_url" -> "http://pbs.twimg.com/profile_images/553764312716550144/ViDhuySK_normal.jpeg"
"twitter_verified" -> "False"
"twitter_description" -> "Christian, husband (of @HollyKateSkeet), father, feminist, software engineer (currently at Google), author, @stackoverflow contributor."
"twitter_created_at" -> "Wed Dec 17 17:14:47 +0000 2008"
"twitter_listed_count" -> "1804"
"twitter_location" -> "Reading, UK"
"twitter_last_crawled" -> "1470072821.846624"
--------------if matched with StackExchange-----------------
"so_account_id" -> "11683"
"so_last_crawled" -> "1470072846.222038"
"so_url" -> "http://stackoverflow.com/users/22656"
"so_creation_date" -> "2008-09-26 20:05:05"
"so_display_name" -> "Jon Skeet"
"so_location" -> "Reading, United Kingdom",
"so_profile_image" -> "https://www.gravatar.com/avatar/6d8ebb117e8d83d74ea95fbdd0f87e13?s=128&d=identicon&r=PG"
"so_reputation" -> "883882"
"reverseengineering.stackexchange.com:offset"
"count" -> 6
"name" -> offset
"site" -> "reverseengineering.stackexchange.com"
"set:stackexchange:matched_experts_set" : REDIS.SET("magento.stackexchange.com:ctasca",...)
"topics:magento.stackexchange.com:user" : REDIS.SET("magento.stackexchange.com:topic1", "stackoverflow.com:topic2")
"magento.stackexchange.com:ctasca"
"so_reputation" -> 31
"so_profile_image" -> "https://i.stack.imgur.com/Q7lUq.jpg?s=128&g=1"
"so_display_name" -> "ctasca"
"so_last_crawled" -> "1470509582.730663"
"so_id" -> 5045
"so_link" -> "http://magento.stackexchange.com/users/5045/ctasca"
--------------if matched with Twitter profile---------------
"twitter_id" -> ""
"twitter_name" -> ""
"twitter_screen_name" -> ""
"twitter_profile_image_url" -> ""
"twitter_verified" -> ""
"twitter_description" -> ""
"twitter_created_at" -> ""
"twitter_listed_count" -> ""
"twitter_location" -> ""
"twitter_last_crawled" -> ""
"topic"
"q_name" -> "Computer Programming"
"q_description" -> "Computer programming (often shortened to programming) is a process that leads from an original formulation of a computing problem to executable programs. It involves activities such as analysis, understanding, and generica..."
"q_num_questions" -> "567000"
"q_num_followers" -> "5000000"
"q_num_edits" -> "567"
"q_last_crawled" -> "1470596056.640988"
"q_experts_last_crawled" -> "1470596056.640988"
"quora:matched_experts_set" : REDIS.SET("quora:expert:writer_name",...)
"quora:topics:writer_name" : REDIS.SET("topic1","topic2"...)
"quora:expert:writer_name"
"q_name" -> "Sanjay Nandan",
"q_short_description" -> "",
"q_profile_image_url" -> "https://qph.ec.quoracdn.net/main-thumb-43516739-100-tnrxfcqmwqvzcrmhydixcdlheqozuznw.jpeg",
"q_num_views" -> "88833",
"q_last_crawled" -> "1470904208.092124"
--------------if matched with Twitter profile---------------
"twitter_id" -> ""
"twitter_name" -> ""
"twitter_screen_name" -> ""
"twitter_profile_image_url" -> ""
"twitter_verified" -> ""
"twitter_description" -> ""
"twitter_created_at" -> ""
"twitter_listed_count" -> ""
"twitter_location" -> ""
"twitter_last_crawled" -> ""
DB - 5 Combined User (Used for fast lookup given a site and username, what are their linked accounts, if any)
"quora:username" -> Hash("twitter_screen_name": 'username')
"stackexchange:username" -> Hash("twitter_screen_name": 'username')
"twitter:username" -> Hash("so_display_name": 'site.stackexchange.com:username', "quora_name": 'username')
"site:topic1" -> Zset(<site:user1 : score>, <site:user2: score>)
"site:topic2" -> Zset(<site:user3 : score, <site:user4: score>)
"site:topic3" -> Zset(<site:user1 : score>, <site:user4 : score>)
(Score is expertise measure on respective site - reputation for StackOverflow, views on Quora, lg(listed frequency) on Twitter)
"quora:404" -> Set("https://www.quora.com/404.url", "https://www.quora.com/301.url")
"quora:topicUrls" -> Set("/sitemap/alphabetical_topics/jp")
"https://www.quora.com/sitemap/alphabetical_topics/n0" -> Hash("q_last_crawled" -> 12345534.223454)
...
What is needed?
- Start from https://www.quora.com/topic/Computer-Programming (example)
- Look for more topic under related topics section, repeat step 1
- If most viewed writers section exist, follow link e.g. url - https://www.quora.com/topic/Computer-Programming/writers
- Scrap number of views, username,
- Follow username link to user profile about page, scrap user profile
- Get username, description, followers, following, location, links to social network, all time views, last 30-day views, profile_image_url, Knows About section
For every user:
- Go to https://www.quora.com/profile/Ken-Mazaika/activity (example)
- Count number of posts between start and end time to arrive at activity frequency
- Create virtualenv: mkvirtualenv fyp
- Dependencies: pip install -r requirements.txt
- Install framework python either by symbolic linking from MAC OS system installed version or follow tutorial here: http://matplotlib.org/faq/virtualenv_faq.html. Name the executable as fpython (frameworkpython)
DEVELOPMENT
- "workon fyp" (name of your virtualenv) to get into the virtualenv
- fpython (name_of_pythonfile.py)
- Foreground: fpython list.py
- Background, redirect output to logfile: fpython list.py & 2>3 path/to/outfile.log
- Daemon: Circus: circusd --daemon circus.ini (use circusctl to control the daemon)
- Cognos - https://www.mpi-sws.org/~gummadi/papers/twitter_wtf.pdf
- Cover Density Ranking - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.1615&rep=rep1&type=pdf
- StackExchange API - https://api.stackexchange.com/
- Twitter API - https://dev.twitter.com/rest/public/search
- Circus - https://circus.readthedocs.io/en/0.9.2/
- Scrapy - http://scrapy.org/