bhanuc / cloudproxy

Hide your scrapers IP behind the cloud. Provision proxy servers across different cloud providers to improve your scraping success.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CodeCov Coverage Codacy Quality Docker Cloud Build Status Contributors Forks Stargazers Issues MIT License

CloudProxy

cloudproxy

About The Project

The purpose of CloudProxy is to hide your scrapers IP behind the cloud. It allows you to spin up a pool of proxies using popular cloud providers with just an API token. No configuration needed.

CloudProxy exposes an API with the IPs and credentials of the provisioned proxies.

Providers supported:

Planned:

  • Google Cloud
  • Azure
  • Scaleway
  • Vultr

Inspired by

This project was inspired by Scrapoxy, though that project no longer seems actively maintained.

The primary advantage of CloudProxy over Scrapoxy is that CloudProxy only requires an API token from a cloud provider. CloudProxy automatically deploys and configures the proxy on the cloud instances without the user needing to preconfigure or copy an image.

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

All you need is:

  • Docker

Installation

Environment variables:

Required

USERNAME - set the username for the forward proxy.

PASSWORD - set the password for the forward proxy.

Optional

AGE_LIMIT - set the age limit for your forward proxies in seconds. Once the age limit is reached, the proxy is replaced. A value of 0 disables the feature. Default value: 0.

See individual provider pages for environment variables required in above providers supported section.

Docker (recommended)

For example:

docker run -e USERNAME='CHANGE_THIS_USERNAME' \
    -e PASSWORD='CHANGE_THIS_PASSWORD' \
    -e DIGITALOCEAN_ENABLED=True \
    -e DIGITALOCEAN_ACCESS_TOKEN='YOUR SECRET ACCESS KEY' \
    -it -p 8000:8000 laffin/cloudproxy:latest

It is recommended to use a Docker image tagged to a version e.g. laffin/cloudproxy:0.3.0-beta, see releases for latest version.

Usage

CloudProxy exposes an API on localhost:8000. Your application can use the below API to retrieve the IPs with auth for the proxy servers deployed. Then your application can use those IPs to proxy.

The logic to cycle through IPs for proxying will need to be in your application, for example:

import random
import requests as requests


# Returns a random proxy from CloudProxy
def random_proxy():
    ips = requests.get("http://localhost:8000").json()
    return random.choice(ips['ips'])


proxies = {"http": random_proxy(), "https": random_proxy()}
my_request = requests.get("https://api.ipify.org", proxies=proxies)

CloudProxy UI

cloudproxy-ui

You can manage CloudProxy via an API and UI. You can access the UI at http://localhost/ui.

You can scale up and down your proxies and remove them for each provider via the UI.

CloudProxy API

List available proxy servers

Request

GET /

curl -X 'GET' 'http://localhost:8000/' -H 'accept: application/json'

Response

{"ips":["http://username:password:192.168.0.1:8899", "http://username:password:192.168.0.2:8899"]}

List random proxy server

Request

GET /random

curl -X 'GET' 'http://localhost:8000/random' -H 'accept: application/json'

Response

["http://username:password:192.168.0.1:8899"]

Remove proxy server

Request

DELETE /destroy

curl -X 'DELETE' 'http://localhost:8000/destroy?ip_address=192.1.1.1' -H 'accept: application/json'

Response

["Proxy to be destroyed"]

Get provider

Request

GET /provider/digitalocean

curl -X 'GET' 'http://localhost:8000/providers/digitalocean' -H 'accept: application/json'

Response

  {
    "ips": [
      "192.1.1.2",
      "192.1.1.3"
    ],
    "scaling": {
      "min_scaling": 2,
      "max_scaling": 2
    }
  }

Update provider

Request

PATCH /provider/digitalocean

curl -X 'PATCH' 'http://localhost:8000/providers/digitalocean?min_scaling=5&max_scaling=5' -H 'accept: application/json'

Response

  {
    "ips": [
      "192.1.1.2",
      "192.1.1.3"
    ],
    "scaling": {
      "min_scaling": 5,
      "max_scaling": 5
    }
  }

CloudProxy runs on a schedule of every 30 seconds, it will check if the minimum scaling has been met, if not then it will deploy the required number of proxies. The new proxy info will appear in IPs once they are deployed and ready to be used.

Roadmap

The project is at early alpha with limited features. In the future more providers will be supported, autoscaling will be implemented and a rich API to allow for blacklisting and recycling of proxies.

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Your Name - @christianlaffin - christian.laffin@gmail.com

Project Link: https://github.com/claffin/cloudproxy

Acknowledgements

About

Hide your scrapers IP behind the cloud. Provision proxy servers across different cloud providers to improve your scraping success.

License:MIT License


Languages

Language:Python 73.4%Language:Vue 20.6%Language:Shell 2.6%Language:HTML 1.7%Language:JavaScript 1.0%Language:Dockerfile 0.7%