Issue with Capybara Gem - Scraping Blocked on Indeed Site

Question

Issue with Capybara Gem - Scraping Blocked on Indeed Site

ror-web-expert opened this issue 8 months ago · comments

Problem:
I'm facing an issue with my Rails application that involves scraping data from different sites using the Capybara gem. Everything works fine for most sites, but I'm encountering a problem specifically with Indeed.

Description:
When I attempt to scrape data from Indeed with the headless option set to true, I get blocked. However, when I set the headless option to false, the scraping works fine. Upon inspecting the screenshot generated by @session.save_screenshot, it clearly indicates that I've been blocked.

Steps to Reproduce:

Set headless: true in browser options.
Attempt to scrape data from Indeed.
Observe the blocking issue.
Expected Behavior:
Scraping should work seamlessly with headless mode enabled, just as it does for other sites.

Environment:

Rails Version: 7
Capybara Version: 3.39.2
Nokogiri Version: 1.15.4-x86_64-linux

Additional Information:

Adding a proxy service did not resolve the issue.
The problem seems specific to the interaction between Indeed and Capybara with headless mode.

Workaround:

Setting headless: false resolves the blocking issue, but this is not an ideal solution.

Request for Assistance:
I'm seeking guidance on potential solutions or workarounds to enable headless scraping for Indeed without being blocked.
Any insights or recommendations would be greatly appreciated.

Thank you for your assistance!

Thomas Walpole · Answer 1 · Sat Jan 27 2024 05:34:27 GMT+0800 (China Standard Time)

I fail to see how this is an issue with Capybara. Capybara is a tool for testing web apps, not a scraping tool actively hiding itself from sites. The fact that you're using it abuse the terms of service of indeed is not something we can help you with.