This module allows Scrapy to rotate Tor IPs.
Simple install, via pip:
pip install scrapy-tor-proxy-rotation
To configure Tor. First, install :
sudo apt-get install tor
Stop its execution to make configurations:
sudo service tor stop
Open your configuration file as root, available in /etc/tor/torrc, for example, using nano:
sudo nano /etc/tor/torrc
Place the lines below and save:
ControlPort 9051
CookieAuthentication 0
Restart Tor:
sudo service tor start
It is possible to verify the IP of your machine and compare it as Tor in the following way:
- To see your IP:
curl http://icanhazip.com/
- To see the ip of Tor:
torify curl http://icanhazip.com/
For Scrapy it is necessary to use an intermediary, in this case or Privoxy.
Tor Default Proxy Server: 127.0.0.1:9050
- Install:
sudo apt install privoxy
- Stop the service:
sudo service privoxy stop
- Open the config file:
sudo nano /etc/privoxy/config
- Add the following lines:
forward-socks5t / 127.0.0.1:9050 .
- Start the service:
service privoxy start
Test:
torify curl http://icanhazip.com/
curl -x 127.0.0.1:8118 http://icanhazip.com/
After performing these configurations, it is possible to integrate Tor with Scrapy.
-
Configure the middleware in your settings file (settings.py):
DOWNLOADER_MIDDLEWARES = { ..., 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'tor_ip_rotator.middlewares.TorProxyMiddleware': 100 }
-
Add those in your custom_settings in your spider or in (settings.py) if you want to use them on all spiders from the project:
TOR_IPROTATOR_ENABLED = True TOR_IPROTATOR_CHANGE_AFTER = #número de requisições feitas em um mesmo endereço IP
By default, an IP can be reused after 10 other uses. This value can be altered by the variable TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER, as below:
TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER = #
A large number can also make it slower to retrieve a new IP to use or find. If the value is 0, there will be no record of used IPs.