urllib.error.HTTPError: HTTP Error 404: Not Found
Macodemia opened this issue · comments
Hello,
I have cloned the project and created a config file using the responsible Python script.
With wg-gesucht and immowelt the scraping works just perfectly fine.
However, when scraping kleinanzeigen, i receive the error:
urllib.error.HTTPError: HTTP Error 404: Not Found
Below is the full stack trace:
/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/bin/python /Users/jack/Development/flathunter/flathunt.py
[2024/02/29 21:04:21|config.py |INFO ]: Using config path /Users/jack/Development/flathunter/config.yaml
[2024/02/29 21:04:21|chrome_wrapper.py |INFO ]: Initializing Chrome WebDriver for crawler...
Traceback (most recent call last):
File "/Users/jack/Development/flathunter/flathunt.py", line 99, in <module>
main()
File "/Users/jack/Development/flathunter/flathunt.py", line 95, in main
launch_flat_hunt(config, heartbeat)
File "/Users/jack/Development/flathunter/flathunt.py", line 35, in launch_flat_hunt
hunter.hunt_flats()
File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 56, in hunt_flats
for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 35, in crawl_for_exposes
return chain(*[try_crawl(searcher, url, max_pages)
File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 35, in <listcomp>
return chain(*[try_crawl(searcher, url, max_pages)
File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 27, in try_crawl
return searcher.crawl(url, max_pages)
File "/Users/jack/Development/flathunter/flathunter/abstract_crawler.py", line 151, in crawl
return self.get_results(url, max_pages)
File "/Users/jack/Development/flathunter/flathunter/abstract_crawler.py", line 139, in get_results
soup = self.get_page(search_url)
File "/Users/jack/Development/flathunter/flathunter/crawler/kleinanzeigen.py", line 56, in get_page
return self.get_soup_from_url(search_url, driver=self.get_driver())
File "/Users/jack/Development/flathunter/flathunter/crawler/kleinanzeigen.py", line 44, in get_driver
self.driver = get_chrome_driver(driver_arguments)
File "/Users/jack/Development/flathunter/flathunter/chrome_wrapper.py", line 69, in get_chrome_driver
driver = uc.Chrome(version_main=chrome_version, options=chrome_options) # pylint: disable=no-member
File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 258, in __init__
self.patcher.auto()
File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/patcher.py", line 178, in auto
self.unzip_package(self.fetch_package())
File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/patcher.py", line 287, in fetch_package
return urlretrieve(download_url)[0]
File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 241, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py
I found related issues #538 and #439.
In there it seems like the problem is related to the version of chrome driver.
Since in my case flathunter works for wg-gesucht and immowelt,
I assume that this issue is different and may be related to kleinanzeigen.
chromedriver --version
ChromeDriver 121.0.6167.184 (057a8ae7deb3374d0f1b04b36304d236f0136188-refs/branch-heads/6167@{#1818})
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version
Google Chrome 121.0.6167.184
and my config file:
# Enable verbose mode (print DEBUG log messages)
# verbose: true
# Should the bot endlessly looop through the URLs?
# Between each loop it waits for <sleeping_time> seconds.
# Note that Ebay will (temporarily) block your IP if you
# poll too often - don't lower this below 600 seconds if you
# are crawling Ebay.
loop:
active: yes
sleeping_time: 600
# Location of the Database to store already seen offerings
# Defaults to the current directory
#database_location: /path/to/database
# List the URLs containing your filter properties below.
# Currently supported services: www.immobilienscout24.de,
# www.immowelt.de, www.wg-gesucht.de, www.kleinanzeigen.de, meinestadt.de and vrm-immo.de.
# List the URLs in the following format:
# urls:
# - https://www.immobilienscout24.de/Suche/...
# - https://www.wg-gesucht.de/...
urls:
- https://www.kleinanzeigen.de/s-wohnung-mieten/schoeneberg/c203l3443
#- https://www.wg-gesucht.de/wohnungen-in-Muenchen.90.2.1.0.html
#- https://www.immowelt.de/suche/berlin/wohnungen/mieten?d=true&pma=1200&rmi=2&sd=DESC&sf=TIMESTAMP&sp=1
# Define filters to exclude flats that don't meet your critera.
# Supported filters include 'max_rooms', 'min_rooms', 'max_size', 'min_size',
# 'max_price', 'min_price', and 'excluded_titles'.
#
# 'excluded_titles' takes a list of regex patterns that match against
# the title of the flat. Any matching titles will be excluded.
# More to Python regex here: https://docs.python.org/3/library/re.html
#
# Example:
# filters:
# excluded_titles:
# - "wg"
# - "zwischenmiete"
# min_price: 700
# max_price: 1000
# min_size: 50
# max_size: 80
# max_price_per_square: 1000
filters:
# There are often city districts in the address which
# Google Maps does not like. Use this blacklist to remove
# districts from the search.
#
# blacklist:
# - Innenstadt
# If an expose includes an address, the bot is capable of
# displaying the distance and time to travel (duration) to
# some configured other addresses, for specific kinds of
# travel.
#
# Available kinds of travel ('gm_id') can be found in the
# Google Maps API documentation, but basically there are:
# - "bicycling"
# - "transit" (public transport)
# - "driving"
# - "walking"
#
# The example configuration below includes a place for
# "John", located at the main train station of munich.
# Two kinds of travel (bicycle and transit) are requested,
# each with a different label. Furthermore a place for
# "Jane" is included, located at the given destination and
# with the same kinds of travel.
# durations:
# - name: John
# destination: Hauptbahnhof, München
# modes:
# - gm_id: transit
# title: "Öff."
# - gm_id: bicycling
# title: "Rad"
# - name: Jane
# destination: Karlsplatz, München
# modes:
# - gm_id: transit
# title: "Öff."
# - gm_id: driving
# title: "Auto"
# Multiline message (yes, the | is supposed to be there),
# to format the message received from the Telegram bot.
#
# Available placeholders:
# - {title}: The title of the expose
# - {rooms}: Number of rooms
# - {price}: Price for the flat
# - {durations}: Durations calculated by GMaps, see above
# - {url}: URL to the expose
message: |
{title}
Zimmer: {rooms}
Größe: {size}
Preis: {price}
Ort: {address}
{url}
# Calculating durations requires access to the Google Maps API.
# Below you can configure the URL to access the API, with placeholders.
# The URL should most probably just kept like that.
# To use the Google Maps API, an API key is required. You can obtain one
# without costs from the Google App Console (just google for it).
# Additionally, to enable the API calls in the code, set the 'enable' key to True
#
# google_maps_api:
# key: YOUR_API_KEY
# url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival}
# enable: False
# If you are planning to scrape immoscout24.de, the bot will need
# to circumvent the sites captcha protection by using a captcha
# solving service. Register at either imagetypers or 2captcha
# (the former is prefered), desposit some funds, uncomment the
# corresponding lines below and replace your API key/token.
# Use driver_arguments to provide options for Chrome WebDriver.
# captcha:
# imagetyperz:
# token: alskdjaskldjfklj
# 2captcha:
# api_key: alskdjaskldjfklj
# driver_arguments:
# - "--headless"
captcha:
# You can select whether to be notified by telegram, apprise or by mattermost
# or Slack webhooks. For all notifiers selected here a configuration must be
# provided below.
# notifiers:
# - telegram
# - apprise
# - mattermost
# - slack
notifiers:
- telegram
# Sending messages using Telegram requires a Telegram Bot configured.
# Telegram.org offers a good documentation about how to create a bot.
# Once you read it, will make sense. Still: bot_token should hold the
# access token of your bot and receiver_ids should list the client ids
# of receivers. Note that those receivers are required to already have
# started a conversation with your bot.
#
# telegram:
# bot_token: 160165XXXXXXX....
# notify_with_images: true
# receiver_ids:
# - 12345....
# - 67890....
telegram:
bot_token: 6896489191:AAGvdqFTdJWUDHhT6qOzWSSZhrJ23WZkopg
receiver_ids:
- '16861054'
# Sending messages via mattermost requires a webhook url provided by a
# mattermost server. You can find a description how to set up a webhook with
# the official mattermost documentation:
# https://docs.mattermost.com/developer/webhooks-incoming.html
# mattermost:
# webhook_url: https://mattermost.example.com/signup_user_complete/?id=abcdef12356
mattermost:
# Sending messages using Apprise requires an Apprise url.
# Apprise allows to send notifications to a wide variety of services.
# You can find a description how to set up an Apprise url with the official
# documentation: https://github.com/caronc/apprise
# Signal notifications are documented here https://github.com/caronc/apprise/wiki/Notify_signal
#
# apprise:
# - gotifys://...
# - mailto://..
# - signal://localhost:9922/{FromPhoneNo}
apprise:
# Sending messages to a Slack channel requires a webhook url. You can find
# a guide on how to set up a Slack webhook in the official documentation:
# https://api.slack.com/messaging/webhooks
#
# slack:
# webhook_url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXX...
slack:
# If you are running the web interface, you can configure Login with Telegram support
# Follow the instructions here to register your domain with the Telegram bot:
# https://core.telegram.org/widgets/login
#
# website:
# bot_name: bot_name_xxx
# domain: flathunter.example.com
# session_key: SomeSecretValue
# listen:
# host: 127.0.0.1
# port: 8080
# If you are deploying to google cloud,
# uncomment this and set it to your project id. More info in the readme.
# google_cloud_project_id: my-flathunters-project-id
# For websites like idealista.it, there are anti-crawler measures that can be
# circumvented using proxies.
# use_proxy_list: True
# If you are having bot detection issues with immobilienscout24,
# you can set the cookie that you get from your logged in account
# Go to the immobilienscout24.de website, log in, and then in the developer tools
# (F12) go to the "Network" tab, then "Cookies" and copy the value of the
# "reese84" cookie.
immoscout_cookie: ''
I appreciate any help on that!
Please let me know, if any further information is required.