N.B.: I'm currently going through, organising, and tidying my code as I upoad it. So this repo's still a work-in-progress which I'm updating (reasonably) regularly
Codebase for Phishalytics: the measurement infrastructure system I designed and built to research phishing and malware attacks on Twitter during my PhD studies at Royal Holloway, University of London.
Measurement system I built during my PhD to collect and analyse large-scale datasets; including phishing and malware attacks on Twitter, blacklist characterisation, and phishing detection capabilities of web browsers.
Interacting with Phishalytics is carried out via an SSH connection in a terminal window. The server-side interface uses GNU Screen. The Screenshot above shows Phishalytics during one of our measurement studies. The layout consists of 18 windows; 16 small and 2 large. The two larger windows display a development area and the system monitor (htop command showing CPU and RAM usage, top processes, etc).
The 16 smaller windows in the above screenshot, labelled s1 to s16, show the following:
-
s1: twitter_stream.py - Twitter filter stream (tweets containing URLs). Each character in this window represents the following:
- # the tweet that is about to be processed is a retweet
- . tweet received from Twitter Stream API for processing
- ! for each URL within this single tweet
- + tweet saved to our system's database
-
s2: twitter_stream_sample.py - Twitter sample stream (same characters as above)
-
s3: update_gsb.py - Update our local copy of GSB blacklist
-
s4: update_phishtank_and_openphish.py - Update our local copies of PT and OP blacklists
-
s5: twitter_gsb_lookup_fast.py - Fast Google Safe Browsing tweeted URL lookup system
-
s6: twitter_gsb_lookup.py - Comprehensive Google Safe Browsing tweeted URL lookup system
-
s7: twitter_op_pt_lookup.py - Comprehensive Openphish and Phishtank tweeted URL lookup system
-
s8: twitter_op_pt_lookup_fast.py - Fast Openphish and Phishtank tweeted URL lookup system
-
s9: lookup_gsb_timestamps.py - GSB timestamp lookup system
-
s10: twitter_search_api_lookup.py - Twitter search API lookup system
-
s11: trending_hashtags.py - Retrieve and save current trending hashtags from Twitter API
-
s12: post_twitter_collection_processing.py - Post Twitter collection processing (for metadata such as: lookup redirections chains, num URL hops, landing page URL, calculate Levenshtein distance, determine if trending hashtags used, etc)
-
s13: compare_gsb_updates.py - Calculate, update, and compare GSB sizes
-
s14: Not currently being used for the present study
-
s15: status_monitor.py - Check everything is functioning correctly, check all feeds are live, etc. Send error notification emails to admin
-
s16: trending_hashtags_london.py - Currently trending hashtags on Twitter for London
- Twitter API key (dev site)
- Google Safe Browsing API key (dev site)
- Phishtank API key (dev site)
- Openphish API key (dev site)
- Bitly API key (dev site)
- gglsbl: Python client library for Google Safe Browsing Update API v4
- tweepy: Python client library for Twitter API
Used to run the main measurement study experiments:
Service name | Description | File |
---|---|---|
Twitter Stream | Stream public tweets that contain URLs via Twitter's filter stream API and save into local database | twitter_stream.py |
Twitter Sample Stream | Stream a small random sample of all public tweets via Twitter's sample stream API and save into local database | twitter_stream_sample.py |
Update GSB | Update our local copy of GSB blacklist using Safe Browsing Update API (v4) | update_gsb.py |
Update Phishtank and Openphish | Update our local copies of Openphish and Phishtank blacklists | update_phishtank_and_openphish.py |
Comprehensive GSB Twitter Lookup | Looks up all tweeted URLs in GSB blacklist since measurement experiment began. Gets progresively slower as experiment duration increases | twitter_gsb_lookup.py |
Fast GSB Twitter Lookup | Looks up all tweeted URLs in GSB blacklist from past 24 hours (approx. 1 million) | twitter_gsb_lookup_fast.py |
Comprehensive PT and OP Twitter Lookup | Looks up all tweeted URLs in both Openphish and Phishtank blacklists since measurement experiment began | twitter_op_pt_lookup.py |
Fast PT and OP Twitter Lookup | Looks up all tweeted URLs in Openphish and Phishtank blacklists blacklists from past 24 hours (approx. 1 million) | twitter_op_pt_lookup_fast.py |
GSB Timestamp Lookup | Lookup timestamps for when URLs were added to GSB | lookup_gsb_timestamps.py |
Twitter Search API Lookup | Determine when blacklisted URLs were first tweeted using Twitter's search API | twitter_search_api_lookup.py |
Trending Hashtags | Retrieve and save current global trending hashtags from Twitter's trends/place API. Uses WOEID=1 for global location. | trending_hashtags.py |
Post Twitter Collection Processing | Computes and saves metadata such as: lookup redirections chains, num URL hops, landing page URL, calculate Levenshtein distance, determine if trending hashtags used, etc. | post_twitter_collection_processing.py |
Compare GSB Updates | Calculate size of GSB blacklist on each update and across versions | compare_gsb_updates.py |
Status Monitor | Check everything is functioning correctly, check all feeds are live, etc. Send error notification emails to admin to alert if any problem | status_monitor.py |
Trending Hashtags London | Prints a list of currently trending hashtags in London, UK. Updates every 30 seconds | trending_hashtags_london.py |
Experimental setups, tests, supporting system, etc:
Service name | Description | File |
---|---|---|
ASCII Text | Used at ISG open day stall to showcase my measurement infrastructure. Displays ASCII text of project title and authors in main GNU screen window, whilst experiments ran in other windows. Requires asciimatics | ascii_text.py |
Bityl Click Stats | Leverages the Bitly API to access click stats for Bitly URLs collected via Twitter's Stream API | bitly_click_stats.py |
CertStream | Leverages CertStream-Python (library to see SSL certs as they're issued live) to create a dataset of potentially suspicious SSL domain certificates. For later verification with certstream_blacklist_url_lookup.py | certstream.py |
CertStream Blacklist URL Lookup | Check existing dataset of potentially suspicious SSL domain certificates for blacklist membership | certstream_blacklist_url_lookup.py |
Compare GSB URL Hash Prefixes | Compares URL hash prefixes in GSB blacklist to determine hash collisions and unique URL hashes | compare_gsb_hash_prefixes.py |
Compare T.co to Blacklists | Used to count total number of blocked t.co URLs that also appear in GSB, OP, or PT | compare_t_co_to_blacklists.py |
Count Num Domains | Count (and extract) total domain names in tweeted URLs dataset | count_num_domains.py |
Research papers that feature results obtained with Phishalytics:
Winner of the Best Paper and Best Student Paper awards:
BELL, S., AND KOMISARCZUK, P. "Measuring the Effectiveness of Twitter’s URL Shortener (t.co) at Protecting Users from Phishing and Malware Attacks". In Proceedings of the Australasian Computer Science Week Multiconference (2020). Link to paper.
BELL, S., AND KOMISARCZUK, P. "An Analysis of Phishing Blacklists: Google Safe Browsing, OpenPhish, and PhishTank". In Proceedings of the Australasian Computer Science Week Multiconference (2020). Link to paper.
BELL, S., PATERSON, K., AND CAVALLARO, L. "Catch Me (On Time) If You Can: Understanding the Effectiveness of Twitter URL Blacklists". arXiv preprint arXiv:1912.02520 (2019). Link to paper.