bglynch / scrapy-upwork

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

scrapy-upwork

About

This is a single spider scrapy project.

Given a username, password and secret answer, this spider will log in to upwark.com and scrape data from the /find-jobs/ page.

This spider handles the zero resistance login flow as well as handling reCAPTCHA and a secret answer pages. The secret answer page bypass is automated, while the reCAPTCHA page requires user input to bypass. To do this, if a reCAPTCHA page appears the spider is paused. Once the user completes the reCAPTCHA, a user must click enter inside the terminal to continue the spider.

If the spider stalls on a page, errors are handled using selenium timeouts. If selenium raises a timeout error, a screenshot will be taken of the page and the html will be downloaded. These will be saved to the scrapy-upwork/output/errors/ directory to be used for debugging.

To see notes on the approach to building this spider view doc/approach.md

Stack

  • Python 3.9
  • scrapy
  • selenium
  • pydantic
  • pycountry

Running locally

terminal
git clone https://github.com/bglynch/scrapy-upwork.git
cd scrapy-upwork/

# create virtual env and install dependencies
virtualenv -p python3.9 venv
source venv/bin/activate
pip install -r requirements.txt

# rename secret file
mv argyle_challenge/secret.sample.py argyle_challenge/secret.py

Modify credential in secret.py to your login credentials.

scrapy-upwork/argyle_challenge/secret.py
username = 'your-username'
password = 'your-username'
secret_answer = 'your-username'

Run the spider.

terminal
cd argyle_challenge/
scrapy crawl upwork

After running, if successful, data is exported to scrapy-upwork/output/data/jobs.json

About


Languages

Language:Python 100.0%