dragospopa420 / scrapy-tor-proxy-rotator

Rotate TOR IPs in Scrapy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scrapy Tor Proxy Rotation

This module allows Scrapy to rotate Tor IPs.

Install

Simple install, via pip:

pip install scrapy-tor-proxy-rotation

Config Tor

To configure Tor. First, install :

sudo apt-get install tor

Stop its execution to make configurations:

sudo service tor stop

Open your configuration file as root, available in /etc/tor/torrc, for example, using nano:

sudo nano /etc/tor/torrc

Place the lines below and save:

ControlPort 9051
CookieAuthentication 0

Restart Tor:

sudo service tor start

It is possible to verify the IP of your machine and compare it as Tor in the following way:

  • To see your IP:
    curl http://icanhazip.com/
  • To see the ip of Tor:
    torify curl http://icanhazip.com/   

For Scrapy it is necessary to use an intermediary, in this case or Privoxy.

Tor Default Proxy Server: 127.0.0.1:9050

Install and Config Privoxy:

  • Install:
    sudo apt install privoxy
  • Stop the service:
    sudo service privoxy stop
  • Open the config file:
    sudo nano /etc/privoxy/config
  • Add the following lines:
    forward-socks5t / 127.0.0.1:9050 .
  • Start the service:
    service privoxy start
    

Test:

torify curl http://icanhazip.com/
curl -x 127.0.0.1:8118 http://icanhazip.com/

Use

After performing these configurations, it is possible to integrate Tor with Scrapy.

  • Configure the middleware in your settings file (settings.py):

    DOWNLOADER_MIDDLEWARES = {
        ...,
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
        'tor_ip_rotator.middlewares.TorProxyMiddleware': 100
    }
  • Add those in your custom_settings in your spider or in (settings.py) if you want to use them on all spiders from the project:

    TOR_IPROTATOR_ENABLED = True
    TOR_IPROTATOR_CHANGE_AFTER = #número de requisições feitas em um mesmo endereço IP

By default, an IP can be reused after 10 other uses. This value can be altered by the variable TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER, as below:

TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER = #

A large number can also make it slower to retrieve a new IP to use or find. If the value is 0, there will be no record of used IPs.

About

Rotate TOR IPs in Scrapy

License:MIT License


Languages

Language:Python 100.0%