Reading from a file of list of users and crawling through their basic info

Question

Reading from a file of list of users and crawling through their basic info

deenadayalans opened this issue 5 years ago · comments

I am trying to get the basic info of a user like name, alias, bio, number of followers, following, posts by giving an input file with list of instagram profiles and implemented it using threads. But I get connection refused error randomly for even a valid profiles as there might be a limit set by the instagram.

Is there any other way of overcoming it and improve the below code.

concurrent = 4

def get_data_and_write_json():
    while True:
        username = q.get()
        print(username)
        browser = init_chromedriver(chrome_options, capabilities)
        information = []
        invalid_users_list = []
        connection_refused_user_list = []
        try:
            information = extract_information(browser,  username, Settings.limit_amount)
        except NoInstaProfilePageFound:
            time.sleep(5)
            print("No profile is found with the user " + username)
            invalid_users_list.append(username)
            Datasaver.save_invalid_profiles_list(username)
    #sys.exit(1)
        except TooManyRequestsError:
            time.sleep(5)
            print("Connection is refused for user " + username)
            connection_refused_user_list.append(username)
            Datasaver.save_connection_refused_users_list(username)

        if username not in invalid_users_list:
            Datasaver.save_profile_json(username,information)

        q.task_done()

        browser.delete_all_cookies()
        browser.quit()        

q = queue.Queue(concurrent * 2)
for i in range(concurrent):
        t = Thread(target=get_data_and_write_json)
        t.daemon = True
        t.start()

try:
    filename = get_input_file_name()
    with open(filename) as inp:
        for ip in inp:
            username = ''
            username_array  = ip[:-1].split("/")
            if username_array[-1] != "":
                username = username_array[-1]
            else:
                username = username_array[-2]
            q.put(username.strip())
        q.join()      

except KeyboardInterrupt:
    browser.quit()
    sys.exit(1)

Timo Dörsching · Answer 1 · Thu Feb 21 2019 16:10:04 GMT+0800 (China Standard Time)

Which error did you get?

deenadayalans · Answer 2 · Thu Feb 21 2019 20:53:35 GMT+0800 (China Standard Time)

I get these Too Many Requests Error (http error code 429) and it skips the profile to the next one. The same error occurs for few valid profiles continuously and the it works normal and so on. Is it something to do with the above code or because of the limit set on the server side.?

Timo Dörsching · Answer 3 · Fri Feb 22 2019 01:50:37 GMT+0800 (China Standard Time)

"429 Too Many Requests" is the answer. You've been crawling to much with your ipadress on instagram. (which is official not allowed) so they block you.
possible solutions:

crawl less
use different interconnection / server
use proxy

proxy:

replace

browser = webdriver.Chrome('./assets/chromedriver', chrome_options=chrome_options)

with

proxy_address = "123.456.789.000"
proxy_port = 8080

prox = Proxy()
proxy = ":".join([proxy_address, proxy_port])
prox.proxy_type = ProxyType.MANUAL
prox.http_proxy = proxy
prox.socks_proxy = proxy
prox.ssl_proxy = proxy
prox.add_to_capabilities(capabilities)
browser = webdriver.Chrome('./assets/chromedriver', chrome_options=chrome_options)

deenadayalans · Answer 4 · Fri Feb 22 2019 13:32:17 GMT+0800 (China Standard Time)

I was getting the below exception ,


`Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "crawl_profile.py", line 64, in get_data_and_write_json
    browser = init_chromedriver(chrome_options, capabilities)
  File "/home/dell/work_deena/instagram-profilecrawl/util/chromedriver.py", line 16, in init_chromedriver
    chrome_options=chrome_options)
  File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
    desired_capabilities=desired_capabilities)
  File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: cannot parse capability: proxy
from invalid argument: Specifying 'socksProxy' requires an integer for 'socksVersion'
  (Driver info: chromedriver=2.45.615279 (12b89733300bd268cff3b78fc76cb8f3a7cc44e5),platform=Linux 4.15.0-45-generic x86_64)

`

And I changed the chromedriver with specific 2.42 version based on a comment from this link (https://github.com/timgrossmann/InstaPy/issues/3259) and I tried it again but getting below exception,


Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "crawl_profile.py", line 69, in get_data_and_write_json
    information = extract_information(browser,  username, Settings.limit_amount)
  File "/home/dell/work_deena/instagram-profilecrawl/util/extractor.py", line 226, in extract_information
    web_adress_navigator(browser, user_link)
  File "/home/dell/work_deena/instagram-profilecrawl/util/util.py", line 48, in web_adress_navigator
    WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "viewport")))
  File "/home/dell/.local/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

Timo Dörsching · Answer 5 · Wed Feb 27 2019 19:36:37 GMT+0800 (China Standard Time)

Specifying 'socksProxy' requires an integer for 'socksVersion'

use proxy_port = 8080 instead of proxy_port = "8080"

Emmanuel · Answer 6 · Thu Aug 08 2019 01:54:46 GMT+0800 (China Standard Time)

Tim, can I use the above proxy setting on Mac?