urllib.error.HTTPError: HTTP Error 404: Not Found

Question

urllib.error.HTTPError: HTTP Error 404: Not Found

Macodemia opened this issue 4 months ago · comments

Hello,

I have cloned the project and created a config file using the responsible Python script.
With wg-gesucht and immowelt the scraping works just perfectly fine.
However, when scraping kleinanzeigen, i receive the error:

urllib.error.HTTPError: HTTP Error 404: Not Found

Below is the full stack trace:

/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/bin/python /Users/jack/Development/flathunter/flathunt.py 
[2024/02/29 21:04:21|config.py               |INFO    ]: Using config path /Users/jack/Development/flathunter/config.yaml
[2024/02/29 21:04:21|chrome_wrapper.py       |INFO    ]: Initializing Chrome WebDriver for crawler...
Traceback (most recent call last):
  File "/Users/jack/Development/flathunter/flathunt.py", line 99, in <module>
    main()
  File "/Users/jack/Development/flathunter/flathunt.py", line 95, in main
    launch_flat_hunt(config, heartbeat)
  File "/Users/jack/Development/flathunter/flathunt.py", line 35, in launch_flat_hunt
    hunter.hunt_flats()
  File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 56, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
  File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 35, in crawl_for_exposes
    return chain(*[try_crawl(searcher, url, max_pages)
  File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 35, in <listcomp>
    return chain(*[try_crawl(searcher, url, max_pages)
  File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 27, in try_crawl
    return searcher.crawl(url, max_pages)
  File "/Users/jack/Development/flathunter/flathunter/abstract_crawler.py", line 151, in crawl
    return self.get_results(url, max_pages)
  File "/Users/jack/Development/flathunter/flathunter/abstract_crawler.py", line 139, in get_results
    soup = self.get_page(search_url)
  File "/Users/jack/Development/flathunter/flathunter/crawler/kleinanzeigen.py", line 56, in get_page
    return self.get_soup_from_url(search_url, driver=self.get_driver())
  File "/Users/jack/Development/flathunter/flathunter/crawler/kleinanzeigen.py", line 44, in get_driver
    self.driver = get_chrome_driver(driver_arguments)
  File "/Users/jack/Development/flathunter/flathunter/chrome_wrapper.py", line 69, in get_chrome_driver
    driver = uc.Chrome(version_main=chrome_version, options=chrome_options) # pylint: disable=no-member
  File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 258, in __init__
    self.patcher.auto()
  File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/patcher.py", line 178, in auto
    self.unzip_package(self.fetch_package())
  File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/patcher.py", line 287, in fetch_package
    return urlretrieve(download_url)[0]
  File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 241, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py

I found related issues #538 and #439.
In there it seems like the problem is related to the version of chrome driver.

Since in my case flathunter works for wg-gesucht and immowelt,
I assume that this issue is different and may be related to kleinanzeigen.

chromedriver --version
ChromeDriver 121.0.6167.184 (057a8ae7deb3374d0f1b04b36304d236f0136188-refs/branch-heads/6167@{#1818})

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version
Google Chrome 121.0.6167.184

and my config file:

# Enable verbose mode (print DEBUG log messages)
# verbose: true

# Should the bot endlessly looop through the URLs?
# Between each loop it waits for <sleeping_time> seconds.
# Note that Ebay will (temporarily) block your IP if you
# poll too often - don't lower this below 600 seconds if you
# are crawling Ebay.
loop:
  active: yes
  sleeping_time: 600

# Location of the Database to store already seen offerings
# Defaults to the current directory
#database_location: /path/to/database

# List the URLs containing your filter properties below.
# Currently supported services: www.immobilienscout24.de,
# www.immowelt.de, www.wg-gesucht.de, www.kleinanzeigen.de, meinestadt.de and vrm-immo.de.
# List the URLs in the following format:
# urls:
# 	- https://www.immobilienscout24.de/Suche/...
# 	- https://www.wg-gesucht.de/...
urls:
- https://www.kleinanzeigen.de/s-wohnung-mieten/schoeneberg/c203l3443
#- https://www.wg-gesucht.de/wohnungen-in-Muenchen.90.2.1.0.html
#- https://www.immowelt.de/suche/berlin/wohnungen/mieten?d=true&pma=1200&rmi=2&sd=DESC&sf=TIMESTAMP&sp=1

# Define filters to exclude flats that don't meet your critera.
# Supported filters include 'max_rooms', 'min_rooms', 'max_size', 'min_size',
#   'max_price', 'min_price', and 'excluded_titles'.
#
# 'excluded_titles' takes a list of regex patterns that match against
# the title of the flat. Any matching titles will be excluded.
# More to Python regex here: https://docs.python.org/3/library/re.html
#
# Example:
# filters:
#   excluded_titles:
#     - "wg"
#     - "zwischenmiete"
#   min_price: 700
#   max_price: 1000
#   min_size: 50
#   max_size: 80
#   max_price_per_square: 1000
filters:

# There are often city districts in the address which
# Google Maps does not like. Use this blacklist to remove
# districts from the search.
#
# blacklist:
#   - Innenstadt

# If an expose includes an address, the bot is capable of
# displaying the distance and time to travel (duration) to
# some configured other addresses, for specific kinds of
# travel.
#  
# Available kinds of travel ('gm_id') can be found in the
# Google Maps API documentation, but basically there are:
#	- "bicycling"
#	- "transit" (public transport)
#	- "driving"
# - "walking"
# 
# The example configuration below includes a place for
# "John", located at the main train station of munich.
# Two kinds of travel (bicycle and transit) are requested,
# each with a different label. Furthermore a place for
# "Jane" is included, located at the given destination and
# with the same kinds of travel.
# durations:
#   - name: John
#     destination: Hauptbahnhof, München
#     modes:
#       - gm_id: transit
#         title: "Öff."
#       - gm_id: bicycling
#         title: "Rad"
#   - name: Jane
#     destination: Karlsplatz, München
#     modes:
#       - gm_id: transit
#         title: "Öff."
#       - gm_id: driving
#         title: "Auto"

# Multiline message (yes, the | is supposed to be there), 
# to format the message received from the Telegram bot. 
# 
# Available placeholders:
# 	- {title}: The title of the expose
#	- {rooms}: Number of rooms
#	- {price}: Price for the flat
# 	- {durations}: Durations calculated by GMaps, see above
#	- {url}: URL to the expose
message: |
  {title}
  Zimmer: {rooms}
  Größe: {size}
  Preis: {price}
  Ort: {address}

  {url}

# Calculating durations requires access to the Google Maps API. 
# Below you can configure the URL to access the API, with placeholders.
# The URL should most probably just kept like that. 
# To use the Google Maps API, an API key is required. You can obtain one
# without costs from the Google App Console (just google for it).
# Additionally, to enable the API calls in the code, set the 'enable' key to True
#
# google_maps_api:
#   key: YOUR_API_KEY
#   url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival}
#   enable: False

# If you are planning to scrape immoscout24.de, the bot will need 
# to circumvent the sites captcha protection by using a captcha 
# solving service. Register at either imagetypers or 2captcha 
# (the former is prefered), desposit some funds, uncomment the 
# corresponding lines below and replace your API key/token.
# Use driver_arguments to provide options for Chrome WebDriver.
# captcha:
#       imagetyperz:
#             token: alskdjaskldjfklj
#       2captcha:
#             api_key: alskdjaskldjfklj
#       driver_arguments:
#         - "--headless"
captcha:

# You can select whether to be notified by telegram, apprise or by mattermost
# or Slack webhooks. For all notifiers selected here a configuration must be 
# provided below.
# notifiers:
#   - telegram
#   - apprise
#   - mattermost
#   - slack
notifiers:
- telegram

# Sending messages using Telegram requires a Telegram Bot configured. 
# Telegram.org offers a good documentation about how to create a bot.
# Once you read it, will make sense. Still: bot_token should hold the
# access token of your bot and receiver_ids should list the client ids
# of receivers. Note that those receivers are required to already have
# started a conversation with your bot. 
#
# telegram:
#   bot_token: 160165XXXXXXX....
#   notify_with_images: true
#   receiver_ids:
#       - 12345....
#       - 67890....
telegram:
  bot_token: 6896489191:AAGvdqFTdJWUDHhT6qOzWSSZhrJ23WZkopg
  receiver_ids:
  - '16861054'

# Sending messages via mattermost requires a webhook url provided by a
# mattermost server. You can find a description how to set up a webhook with
# the official mattermost documentation:
# https://docs.mattermost.com/developer/webhooks-incoming.html
# mattermost:
#   webhook_url: https://mattermost.example.com/signup_user_complete/?id=abcdef12356
mattermost:

# Sending messages using Apprise requires an Apprise url.
# Apprise allows to send notifications to a wide variety of services.
# You can find a description how to set up an Apprise url with the official
# documentation: https://github.com/caronc/apprise
# Signal notifications are documented here https://github.com/caronc/apprise/wiki/Notify_signal
#
# apprise:
#   - gotifys://...
#   - mailto://..
#   - signal://localhost:9922/{FromPhoneNo}
apprise:

# Sending messages to a Slack channel requires a webhook url. You can find 
# a guide on how to set up a Slack webhook in the official documentation:
# https://api.slack.com/messaging/webhooks
#
# slack:
#   webhook_url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXX...
slack:

# If you are running the web interface, you can configure Login with Telegram support
# Follow the instructions here to register your domain with the Telegram bot:
# https://core.telegram.org/widgets/login
#
# website:
#    bot_name: bot_name_xxx
#    domain: flathunter.example.com
#    session_key: SomeSecretValue
#    listen:
#      host: 127.0.0.1
#      port: 8080

# If you are deploying to google cloud,
# uncomment this and set it to your project id. More info in the readme.
# google_cloud_project_id: my-flathunters-project-id

# For websites like idealista.it, there are anti-crawler measures that can be
# circumvented using proxies.
# use_proxy_list: True

# If you are having bot detection issues with immobilienscout24,
# you can set the cookie that you get from your logged in account
# Go to the immobilienscout24.de website, log in, and then in the developer tools
# (F12) go to the "Network" tab, then "Cookies" and copy the value of the
# "reese84" cookie.
immoscout_cookie: ''

I appreciate any help on that!
Please let me know, if any further information is required.