Reading from a file of list of users and crawling through their basic info
deenadayalans opened this issue · comments
I am trying to get the basic info of a user like name, alias, bio, number of followers, following, posts by giving an input file with list of instagram profiles and implemented it using threads. But I get connection refused error randomly for even a valid profiles as there might be a limit set by the instagram.
Is there any other way of overcoming it and improve the below code.
concurrent = 4
def get_data_and_write_json():
while True:
username = q.get()
print(username)
browser = init_chromedriver(chrome_options, capabilities)
information = []
invalid_users_list = []
connection_refused_user_list = []
try:
information = extract_information(browser, username, Settings.limit_amount)
except NoInstaProfilePageFound:
time.sleep(5)
print("No profile is found with the user " + username)
invalid_users_list.append(username)
Datasaver.save_invalid_profiles_list(username)
#sys.exit(1)
except TooManyRequestsError:
time.sleep(5)
print("Connection is refused for user " + username)
connection_refused_user_list.append(username)
Datasaver.save_connection_refused_users_list(username)
if username not in invalid_users_list:
Datasaver.save_profile_json(username,information)
q.task_done()
browser.delete_all_cookies()
browser.quit()
q = queue.Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=get_data_and_write_json)
t.daemon = True
t.start()
try:
filename = get_input_file_name()
with open(filename) as inp:
for ip in inp:
username = ''
username_array = ip[:-1].split("/")
if username_array[-1] != "":
username = username_array[-1]
else:
username = username_array[-2]
q.put(username.strip())
q.join()
except KeyboardInterrupt:
browser.quit()
sys.exit(1)
Which error did you get?
I get these Too Many Requests Error (http error code 429) and it skips the profile to the next one. The same error occurs for few valid profiles continuously and the it works normal and so on. Is it something to do with the above code or because of the limit set on the server side.?
"429 Too Many Requests" is the answer. You've been crawling to much with your ipadress on instagram. (which is official not allowed) so they block you.
possible solutions:
- crawl less
- use different interconnection / server
- use proxy
proxy:
replace
browser = webdriver.Chrome('./assets/chromedriver', chrome_options=chrome_options)
with
proxy_address = "123.456.789.000"
proxy_port = 8080
prox = Proxy()
proxy = ":".join([proxy_address, proxy_port])
prox.proxy_type = ProxyType.MANUAL
prox.http_proxy = proxy
prox.socks_proxy = proxy
prox.ssl_proxy = proxy
prox.add_to_capabilities(capabilities)
browser = webdriver.Chrome('./assets/chromedriver', chrome_options=chrome_options)
I was getting the below exception ,
`Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "crawl_profile.py", line 64, in get_data_and_write_json
browser = init_chromedriver(chrome_options, capabilities)
File "/home/dell/work_deena/instagram-profilecrawl/util/chromedriver.py", line 16, in init_chromedriver
chrome_options=chrome_options)
File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: cannot parse capability: proxy
from invalid argument: Specifying 'socksProxy' requires an integer for 'socksVersion'
(Driver info: chromedriver=2.45.615279 (12b89733300bd268cff3b78fc76cb8f3a7cc44e5),platform=Linux 4.15.0-45-generic x86_64)
`
And I changed the chromedriver with specific 2.42 version based on a comment from this link (https://github.com/timgrossmann/InstaPy/issues/3259) and I tried it again but getting below exception,
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "crawl_profile.py", line 69, in get_data_and_write_json
information = extract_information(browser, username, Settings.limit_amount)
File "/home/dell/work_deena/instagram-profilecrawl/util/extractor.py", line 226, in extract_information
web_adress_navigator(browser, user_link)
File "/home/dell/work_deena/instagram-profilecrawl/util/util.py", line 48, in web_adress_navigator
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "viewport")))
File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Specifying 'socksProxy' requires an integer for 'socksVersion'
use proxy_port = 8080 instead of proxy_port = "8080"
Tim, can I use the above proxy setting on Mac?