Scraping utilities. Proxy and user-agent retrievers and rotators.
Basically, it scrapes two web pages:
- developers.whatismybrowser for user-agent retrieval.
- free-proxy-list for proxy retrieval.
And obtains two sets: one of proxies, ip:port, and another of user-agents.
This two sets are useful for scraping pages with "safety". Or better said to avoid getting blocked.
The first time the program scrapes the webs it stores the results in two files just to avoid rescraping again and again. Nevertheless, if you want to scrape the pages again you can initialize the
objects ProxyRotator
and UserAgentRotator
with the argument reload=True
.
In utils.py
two classes can be found UserAgentRotator
and ProxyRotator
Only serves proxies from elite-proxy and those that support https.
First import and initialize the class:
from utils import ProxyRotator
proxy_rotator = ProxyRotator()
After this, if proxy_server
is called it will return a ip:port of a random proxy.
proxy_rotator.proxy_server()
With an output such as {'https': '103.216.82.153:6666'}
. The list of proxys can be seen with roxy_rotator.proxies
Works in the same way as the proxy rotator. However, in the class initialization it incorporates two options: max_pages
and name_regex
.
from utils import UserAgentRotator
user_rotator = UserAgentRotator(name_regex=r'(Linux|Computer|Chrome|Web Browser)', max_pages = 2)
name_regex
chooses which kind of user-agents will be retrieved. The default is r'(Linux|Computer|Chrome|Web Browser)'
. Take a look at the web page and change it appropiately with the name of the user-agents that you want to retrieve. max_pages
controls how many pages are used to retrieve user-agents from each type. Beware that a high number here will imply a long scraping time.
You can retrieve a user-agent randomly with:
user_rotator.user_agent_server()
Producing the following output: 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/10.10 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30'
from requests import get
req = get("www.google.com", proxies=proxy_rotator.proxy_server(), headers = {'User-Agent': user_rotator.user_agent_server()}