CorentinB / Architeuthis

MITM HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Architeuthis 🦑

CodeFactor GitHub Build Status

HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.

  • Strictly obeys configured rate-limiting for each IP & Host
  • Seamless exponential backoff retries on timeout or error HTTP codes
  • Requires no additional configuration for integration into existing programs
  • Configurable per-host behavior
  • Monitoring with InfluxDB

grafana

Typical use case

user_case

Usage

git clone https://github.com/simon987/Architeuthis
vim config.json # Configure settings here

docker-compose up

You can add proxies using the /add_proxy API:

curl http://<Architeuthis IP>:5050/add_proxy?url=<url>&name=<name>

Or automatically using Proxybroker:

python3 import_from_broker.py http://<Architeuthis IP>:5050

Example usage with wget

export http_proxy="http://localhost:5050"
# --no-check-certificates is necessary for https mitm
# You don't need to specify user-agent if it's already in your config.json
wget -m -np -c --no-check-certificate -R index.html* http http://ca.releases.ubuntu.com/

With "every": "500ms" and a single proxy, you should see

...
level=trace msg=Sleeping wait=414.324437ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA1SUMS.gpg"
level=trace msg=Sleeping wait=435.166127ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA256SUMS"
level=trace msg=Sleeping wait=438.657784ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA256SUMS.gpg"
level=trace msg=Sleeping wait=457.06543ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/ubuntu-12.04.5-alternate-amd64.iso"
level=trace msg=Sleeping wait=433.394361ms
...

Hot config reload

# Note: this will reset current rate limiters, if there are many active
# connections, this might cause a small request spike and go over
# the rate limits.
./reload.sh

Rules

Conditions

Left operand Description Allowed operators Right operand
body  Contents of the response =, != String w/ wildcard
body  Contents of the response <, > float
status  HTTP response code =, != String w/ wildcard
status  HTTP response code <, > float
response_time  HTTP response code <, > duration (e.g. 20s)
header:<header>  Response header =, != String w/ wildcard
header:<header>  Response header <, > float

Note that response_time can never be higher than the configured timeout value.

Examples:

[
  {"condition":  "header:X-Test>10", "action":  "..."},
  {"condition":  "body=*Try again in a few minutes*", "action":  "..."},
  {"condition":  "response_time>10s", "action":  "..."},
  {"condition":  "status>500", "action":  "..."},
  {"condition":  "status=404", "action":  "..."},
  {"condition":  "status=40*", "action":  "..."}
]

Actions

Action Description
should_retry  Override default retry behavior for http errors (by default it retries on 403,408,429,444,499,>500)
force_retry  Always retry (Up to retries_hard times)
dont_retry  Immediately stop retrying

In the event of a temporary network error, should_retry is ignored (it will always retry unless dont_retry is set)

Note that having too many rules for one host might negatively impact performance (especially the body condition for large requests)

Sample configuration

{
  "addr": "localhost:5050",
  "timeout": "15s",
  "wait": "4s",
  "multiplier": 2.5,
  "retries": 3,
  "hosts": [
    {
      "host": "*",
      "every": "500ms",
      "burst": 25,
      "headers": {
        "User-Agent": "Some user agent for all requests",
        "X-Test": "Will be overwritten"
      }
    },
    {
      "host": "*.reddit.com",
      "every": "2s",
      "burst": 2,
      "headers": {
        "X-Test": "Will overwrite default"
      }
    },
    {
      "host": ".s3.amazonaws.com",
      "every": "2s",
      "burst": 30,
      "rules": [
        {"condition": "status=403", "action": "dont_retry"}
      ]
    }
  ]
}

About

MITM HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.

License:GNU General Public License v3.0


Languages

Language:Go 86.6%Language:Python 4.1%Language:Groovy 3.4%Language:HTML 3.1%Language:Dockerfile 2.7%Language:Shell 0.2%