Source code and training data for the academic paper available here.
-
bin - Contains all entry points for the system
-
system - All source code for system, including fetching and classifying data
-
analysis - All code to analyze classification performance
- Create Python virtual environment (optional) to avoid conflicts with local
packages :
python3 -m venv name-of-environment
- Activate the virtual environment to install packages in isolated environment:
source name-of-environment/bin/activate
- Install required dependencies for disinfo-infra via pip:
pip install -r requirements.txt
- Setup development environment install via setup.py:
python setup.py develop
This allows changes to be tracked while developing without re-installation on each change - Develop!
- If you want to deactivate the virtual environment:
deactivate
- Download source
- pip install -r requirements.txt
- python setup.py install
disinfo_net_data_fetch.py
- continually fetches new domains and raw data for those domains, implemented domain pipes include reddit, twitter, certstream, and domaintools.
disinfo_net_train_classifier.py
- script that trains the classifier from
designated training data.
disinfo_net_classify.py
- script to classify raw data fetched by disinfo_data_fetch.py
. It extracts features, classifies websites, and inserts them into a database table named by the user. It can be run in "live" mode where it constantly classifies new domains as they are fetched from disinfo_data_fetch.py
or it can classify an entire database of candidate domains at once.
-
Orchestrate - contains a conductor class that handles thread creation for domain pipes and worker threads, a worker thread class that fetches raw data for a domain, and a classification thread class that extracts features from raw data and classifies them.
-
Classify - Classifier class that has classes and functions for training, extracting features, and classifying candidate domains
-
Features - Classes to both fetch raw data and extract features from that raw data.
-
Pipe - contains an abstract base class for domain pipes that creates a standard interface for what the system expects when a domain is processed: current implementations of this ABC include Reddit, Twitter, Certstream, and DomainTools domain pipes
-
Postgres - Classes to interact with a postgres database including inserting, checking, and retrieving data.
-
util - various utility classes including classes to unshorten urls, get tlds, and determining ownership of ip addresses
Our system works in two parts, a data fetching script, which inserts raw data into a database table structured as follows:
Attribute (Type) {
domain (Text) (Primary Key),
certificate (Text)
whois (Text),
html (Text),
dns (Text),
post_id (Text)
platform (Text)
insertion_time (UTC)
}
Where each attribute is:
- domain - unique domain which the rest of the data is associated with
- certificate - the certificate, in raw string format, of the domain
- whois - the whois response in raw string format
- html - the raw HTML source of the homepage of the domain
- dns - the IP address(es) that the domain was found to map to
- post_id - the unique post id on the given platform of the post (for verification purposes)
- platform - the platform on which the domain was posted
- insertion_time - time of insertion into the database
The second part of our system, which classifies a domain given raw data about the domain, inserts those classifcations into a database table structured as follows:
Attribute (Type) {
domain (Text) (Primary Key),
classification (Text) (one of: unclassified, non_news, disinformation)
probabilities (JSON),
insertion_time (UTC)
}
Where each attribute is:
- domain - unique domain which the rest of the data is associated with
- classification - the actual classification of the domain by the classifier, one of: news, non_news, disinformation
- probabilities - the probabilities of each class mentioned above in a JSON dictionary format
- insertion_time - time of insertion to the database
Finally, we have a prepopulated training database, including raw data of all of our training data, in the format of:
Attribute (Type) {
domain (Text) (Primary Key),
target (Text) one of: unclassified, non_news, disinformation)
certificate (Text)
whois (Text),
html (Text),
dns (Text),
}
Where each attribute is the same as our raw data table, with target being the known label for the domain.
Navigate your Chrome browser to chrome://extensions and enable developer mode in the top right.
Click "Load Unpacked" and upload the contents of the src/plugin/ directory.
Navigate to any of the sites listed in src/plugins/classified_sites.txt (for example, needtoknow.news) and you will see the warning message.