levanhieu-git / AutoFR

AutoFR generates filter rules for the web to block ads while considering visual breakage automatically.

Home Page:https://athinagroup.eng.uci.edu/projects/ats-on-the-web/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AutoFR: Automated Filter Rule Generation for Adblocking

We introduce AutoFR, a reinforcement learning (RL) framework to fully automate the process of filter rule creation to block ads and minimize visual breakage optimized per-site. The implementation of the framework is based on the paper, "AutoFR: Automated Filter Rule Generation for Adblocking" (USENIX Security 2023). If you use AutoFR for publication, please cite us.

For more information, see our Project Page.

AutoFR Dataset

The dataset and its detailed description are available. In summary, the dataset contains 1042 zip files, one per-site. Each zip file includes the raw collected data of outgoing HTTP requests, AdGraphs, annotated site snapshots, the action space, filter rules, and more.

This includes a Top5k_rules.csv file that shows all the filter rules created within each zip file.

Users must sign a consent form (at the bottom of the web page) before accessing the dataset.

How It Works

AutoFR is the first to balance the trade-off between blocking ads vs. avoiding visual breakage. The user gives AutoFR inputs (e.g., the website to generate rules for, and breakage tolerance threshold w) to AutoFR. It will run our RL algorithm based on multi-arm bandits and generate filter rules that block ads while adhering to the given w threshold.

AutoFR Implementation

AutoFR Example Workflow (Fig. 4 of paper): INITIALIZE (a–c, Alg. 1): (a) spawns n=10 docker instances and visits the site until it finishes loading; (b) extracts the outgoing requests from all visits and builds the action space; (c) extracts the raw graph and annotates it to denote visible ads, images, and text, using JS and Selenium. Once all 10 site snapshots are annotated, we run the RL portion of the AutoFR procedure (steps 1–4). Lastly, AutoFR outputs the filter rules at step 5, e.g., ||s.yimg.com/rq/darla/4-10-0/html/r-sf.html.

For more information, see Background Information.

Running AutoFR

Follow the instructions below to run AutoFR. Preview the dependencies.

Setup

  1. We assume you satisfy the hardware and OS dependencies.

  2. Install the core dependencies.

    $ sudo apt-get install git python3 python3-dev python3-pip

    $ pip3 install virtualenv

    1. Install docker using its official instructions.
  3. $ git clone https://github.com/UCI-Networking-Group/AutoFR.git

    1. If you are an artifact reviewer, git checkout artifact-review
    2. git submodule update --init --recursive

  4. Navigate to the project directory using a terminal window.

  5. Create a virtual environment and activate it.

$ virtualenv --python=python3 [/save-path/autofrenv]

$ source [/save-path/autofrenv]/bin/activate

  1. Install AutoFR dependencies.

$ pip3 install -e .

  1. Build the docker container.

$ docker build -t flg-ad-highlighter-adgraph --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) -f framework-with-ad-highlighter/DockerAdgraphfile .

  1. Create output directories that AutoFR expects. See Understanding the Output for description.

$ mkdir temp_graphs; mkdir -p data/output/

  1. Done. You are now ready to use AutoFR.

Create Filter Rules

  1. Make sure you have followed the setup instructions.
  2. Open up the AutoFR project directory using a terminal window.
  3. Activate your virtual environment.
  4. Choose a site that has ads with AdChoice transparency logos. We use https://cricbuzz.com as an example here.
  5. Choose how many docker instances you can start in parallel. This depends on the number of cores you have on your system. Pass it using the --chunk_threshold argument. Below, we use 6 as an example.
  6. $ python scripts/autofr_controlled.py –site_url "https://cricbuzz.com" –chunk_threshold 6

  7. Filter rules will be presented at the end.

Explore other possible inputs you can give scripts/auto_controlled.py by running:

$ python scripts/auto_controlled.py --help

Understanding the Output

  • Go to data/output to see the raw collected data, such as the outgoing HTTP requests, AdGraphs, and site snapshots.
  • Go to temp_graphs to see the outputted filter rules, the action space, and various other information.
  • The output follows our dataset format. See AutoFR Dataset.

Test the Rules In-the-Wild

Test the rules by applying it on the site that you created them for.

  1. Install an adblocker, like Adblock Plus, into your browser (instructions depend on your browser).
  2. Configure the extension by going to its settings. Turn off all filter lists.
  3. Turn the rules given by AutoFR into per-site rules. For each created rule, append the site it was created for. For instance, if the rule is ||doubleclick.net^ for the site cricbuzz.com, then change it to ||doubleclick.net^$domain=cricbuzz.com
  4. Add in custom rules given by AutoFR (and transform them to be per-site rules). See further instructions.
  5. Refresh the site to see if ads are blocked. Note if there is any visual breakage.
  6. Remember to undo the changes if you use the adblocker personally.

Reuse Site Snapshots

As an example of how site snapshots can be reused, we provide the following instructions on reproducing our results from the paper.

  1. Make sure you have followed the setup instructions.
  2. Open up the AutoFR project directory using a terminal window.
  3. Activate your virtual environment.
  4. Get access to our dataset.
  5. Download the Top5K_rules.csv within our dataset. Open it and choose a zip file to download, keeping track of the site URL as well.
  6. Here we assume you chose AutoFRGEval_www.cricbuzz.com_ad3dce7b.zip. Unzip the file.
  7. $ python scripts/autofr_use_snapshots.py --site_url "https://www.cricbuzz.com/" --snapshot_dir [zip name]/[Snapshots directory]

    • Full example:

    $ python scripts/autofr_use_snapshots.py --site_url "https://www.cricbuzz.com/" --snapshot_dir AutoFRGEval_www.cricbuzz.com_ad3dce7b/AutoFRGControlled_www.cricbuzz.com_AdGraph_Snapshots_82af60e4

  8. It should print out the same filter rules as listed in the CSV file for that particular site.

Requirements and Description

Hardware Dependencies

AutoFR was evaluated using Amazon EC2 instance m5.2xlarge, which has 8 cores, 32 GiB of memory, 35 GiB of storage, and up to 10 Gbps of network bandwidth. We recommend something similar, going as low as 16 GiB of memory with 20 GiB of storage.

Software Dependencies

We list the dependencies that are necessary to run AutoFR. Please follow instructions in Setup section instead to install them.

OS Dependencies

AutoFR has been tested on Debian 5.10 and Ubuntu 18.04.6 LTS.

Core Dependencies

  • Python 3.6+
  • git
  • pip3
  • virtualenv (or conda)
  • docker

Python Dependencies

  • tldextract
  • networkx
  • adblockparser
  • pandas
  • numpy
  • selenium
  • pyvirtualdisplay

Prior Work Dependencies

AutoFR integrates browser extensions and an instrumented browser:

  • Ad Highlighter: a browser extension that detects iframe ads based on AdChoice logos.
  • AdGraph: an instrumented Chromium browser that generates a raw graph representation of how the site is loaded.
  • Adblock Plus: a browser extension that blocks HTTP requests using filter rules, and more...

Background Information

To further understand our system, we describe some important terminology below and reference the paper when needed.

  • Filter Rules: AutoFR focuses on filter rules that block HTTP requests to remove ads while minimizing visual breakage (such as missing images and text) for websites. Example of rules with different granularity: ||example.com^ , ||ads.example.com^, ||example.com/ads.js.
  • Site Snapshots: Graph representations of how a site is loaded. The nodes represent HTML elements, JS scripts, HTTP requests. The edges represent whether HTML elements are connected based on HTML structure, if JS scripts initiated a request, if HTML elements initiated a request, if JS scripts created a HTML element, etc... See Sec 4.1 and Fig. 5.
  • Threshold w: It is a design parameter that helps AutoFR balances the trade-off between blocking ads vs. avoiding visual breakage. It ranges from 0 to 1. The higher values represent the user wanting to avoid breakage at the cost of not creating any filter rules. See Sec. 3 and 3.3.2.
  • Breakage: In our case, breakage means that after a filter rule blocks some HTTP requests, some legitimate images and text of a website may be missing. For instance, for a news site, breakage would entail missing article titles, descriptions, and images. See Eq. (2) of paper.
  • Detecting Visual Components: AutoFR relies on the detection of ads, images, and text on a website. For ads, we rely on Ad Highlighter (see Prior Work Dependencies). For images and text, we write our own custom JS to walk the HTML DOM and identify elements with tags or CSS background-url. To determine visibility, we look at whether the element's width and height are > 2px and its opacity > 0.1. See Sec. 4.3.

Disclaimer

The web changes naturally. AutoFR is only as good as its components. Thus, if a site does not serve ads that Ad Highlighter can detect or use obfuscation techniques, then AutoFR may not be able to generate rules for the given site. See Sec. 5.3.4 and 4.3. There may be other factors, such as w being too high to generate rules for, etc... Over time, AutoFR will improve as we maintain it, but we cannot guarantee that it will work on every website.

Citation

If you create a publication using AutoFR, please cite the corresponding paper as follows:

@inproceedings{le2023autofr,
  title     = {{AutoFR: Automated Filter Rule Generation for Adblocking}},
  author    = {Le, Hieu and Elmalaki, Salma and Markopoulou, Athina and Shafiq, Zubair},
  booktitle = {Proceedings of the 32nd USENIX Security Symposium (USENIX Security)},
  year      = {2023},
  month     = aug,
  address   = {Anaheim, CA}
}

We also encourage you to provide us (athinagroupreleases@gmail.com) with a link to your publication. We use this information in reports to our funding agencies.

Contact Us

Feel free to contact the authors, specifically Hieu Le if you have any questions.

Acknowledgements

To integrate AdGraph successfully, we thank its authors (Umar Iqbal), who graciously provided the necessary code to help parse AdGraphs. We include it in adgraphapi with slight modifications.

About

AutoFR generates filter rules for the web to block ads while considering visual breakage automatically.

https://athinagroup.eng.uci.edu/projects/ats-on-the-web/

License:GNU General Public License v2.0


Languages

Language:C++ 64.1%Language:Python 32.7%Language:JavaScript 2.4%Language:Shell 0.4%Language:Dockerfile 0.2%Language:Makefile 0.0%