sjbell / phishalytics

Measurement system I built during my PhD to collect and analyse large-scale datasets; including phishing and malware attacks on Twitter, blacklist characterisation, and phishing detection capabilities of web browsers.

Home Page:https://phishalytics.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

N.B.: I'm currently going through, organising, and tidying my code as I upoad it. So this repo's still a work-in-progress which I'm updating (reasonably) regularly

Phishalytics

Codebase for Phishalytics: the measurement infrastructure system I designed and built to research phishing and malware attacks on Twitter during my PhD studies at Royal Holloway, University of London.

Contents

Overview

Measurement system I built during my PhD to collect and analyse large-scale datasets; including phishing and malware attacks on Twitter, blacklist characterisation, and phishing detection capabilities of web browsers.

Design Architecture

design architecture for phishalytics

Screenshot

phishalytics terminal screenshot

Interacting with Phishalytics is carried out via an SSH connection in a terminal window. The server-side interface uses GNU Screen. The Screenshot above shows Phishalytics during one of our measurement studies. The layout consists of 18 windows; 16 small and 2 large. The two larger windows display a development area and the system monitor (htop command showing CPU and RAM usage, top processes, etc).

The 16 smaller windows in the above screenshot, labelled s1 to s16, show the following:

Prerequisites

Dependencies

  • gglsbl: Python client library for Google Safe Browsing Update API v4
  • tweepy: Python client library for Twitter API

Services

Core Services

Used to run the main measurement study experiments:

Service name Description File
Twitter Stream Stream public tweets that contain URLs via Twitter's filter stream API and save into local database twitter_stream.py
Twitter Sample Stream Stream a small random sample of all public tweets via Twitter's sample stream API and save into local database twitter_stream_sample.py
Update GSB Update our local copy of GSB blacklist using Safe Browsing Update API (v4) update_gsb.py
Update Phishtank and Openphish Update our local copies of Openphish and Phishtank blacklists update_phishtank_and_openphish.py
Comprehensive GSB Twitter Lookup Looks up all tweeted URLs in GSB blacklist since measurement experiment began. Gets progresively slower as experiment duration increases twitter_gsb_lookup.py
Fast GSB Twitter Lookup Looks up all tweeted URLs in GSB blacklist from past 24 hours (approx. 1 million) twitter_gsb_lookup_fast.py
Comprehensive PT and OP Twitter Lookup Looks up all tweeted URLs in both Openphish and Phishtank blacklists since measurement experiment began twitter_op_pt_lookup.py
Fast PT and OP Twitter Lookup Looks up all tweeted URLs in Openphish and Phishtank blacklists blacklists from past 24 hours (approx. 1 million) twitter_op_pt_lookup_fast.py
GSB Timestamp Lookup Lookup timestamps for when URLs were added to GSB lookup_gsb_timestamps.py
Twitter Search API Lookup Determine when blacklisted URLs were first tweeted using Twitter's search API twitter_search_api_lookup.py
Trending Hashtags Retrieve and save current global trending hashtags from Twitter's trends/place API. Uses WOEID=1 for global location. trending_hashtags.py
Post Twitter Collection Processing Computes and saves metadata such as: lookup redirections chains, num URL hops, landing page URL, calculate Levenshtein distance, determine if trending hashtags used, etc. post_twitter_collection_processing.py
Compare GSB Updates Calculate size of GSB blacklist on each update and across versions compare_gsb_updates.py
Status Monitor Check everything is functioning correctly, check all feeds are live, etc. Send error notification emails to admin to alert if any problem status_monitor.py
Trending Hashtags London Prints a list of currently trending hashtags in London, UK. Updates every 30 seconds trending_hashtags_london.py

Other Services

Experimental setups, tests, supporting system, etc:

Service name Description File
ASCII Text Used at ISG open day stall to showcase my measurement infrastructure. Displays ASCII text of project title and authors in main GNU screen window, whilst experiments ran in other windows. Requires asciimatics ascii_text.py
Bityl Click Stats Leverages the Bitly API to access click stats for Bitly URLs collected via Twitter's Stream API bitly_click_stats.py
CertStream Leverages CertStream-Python (library to see SSL certs as they're issued live) to create a dataset of potentially suspicious SSL domain certificates. For later verification with certstream_blacklist_url_lookup.py certstream.py
CertStream Blacklist URL Lookup Check existing dataset of potentially suspicious SSL domain certificates for blacklist membership certstream_blacklist_url_lookup.py
Compare GSB URL Hash Prefixes Compares URL hash prefixes in GSB blacklist to determine hash collisions and unique URL hashes compare_gsb_hash_prefixes.py
Compare T.co to Blacklists Used to count total number of blocked t.co URLs that also appear in GSB, OP, or PT compare_t_co_to_blacklists.py
Count Num Domains Count (and extract) total domain names in tweeted URLs dataset count_num_domains.py

Publications

Research papers that feature results obtained with Phishalytics:

Winner of the Best Paper and Best Student Paper awards:
BELL, S., AND KOMISARCZUK, P. "Measuring the Effectiveness of Twitter’s URL Shortener (t.co) at Protecting Users from Phishing and Malware Attacks". In Proceedings of the Australasian Computer Science Week Multiconference (2020). Link to paper.

BELL, S., AND KOMISARCZUK, P. "An Analysis of Phishing Blacklists: Google Safe Browsing, OpenPhish, and PhishTank". In Proceedings of the Australasian Computer Science Week Multiconference (2020). Link to paper.

BELL, S., PATERSON, K., AND CAVALLARO, L. "Catch Me (On Time) If You Can: Understanding the Effectiveness of Twitter URL Blacklists". arXiv preprint arXiv:1912.02520 (2019). Link to paper.

About

Measurement system I built during my PhD to collect and analyse large-scale datasets; including phishing and malware attacks on Twitter, blacklist characterisation, and phishing detection capabilities of web browsers.

https://phishalytics.com


Languages

Language:Python 100.0%