justalphie / immo-eliza-scraping-immozila

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Immozilla

forthebadge made-with-python pandas vsCode

πŸ“– Description

This Python project employs web scraping techniques to compile a dataset of real estate properties in Belgium. Specifically, we utilized Immoweb to gather information on +10,000 houses and apartments available for sale across the country.

The outcome of this project provides us with the following headers in our files:

  • property_id
  • locality_name
  • postal_code
  • street_name
  • house_number
  • latitude
  • longitude
  • property_type (house or apartment)
  • property_subtype (bungalow, chalet, mansion, ...)
  • price
  • type_of_sale (note: exclude life sales)
  • number_of_rooms (Number of rooms)
  • living_area (Living area (area in mΒ²))
  • kitchen_type
  • fully_equipped_kitchen (0/1)
  • furnished (0/1)
  • open_fire (0/1)
  • terrace
  • terrace_area (area in mΒ² or null if no terrace)
  • garden
  • garden_area (area in mΒ² or null if no garden)
  • surface_of_good
  • number_of_facades
  • swimming_pool (0/1)
  • state_of_building (new, to be renovated, ...)

πŸ›  Installation

  • clone the repo
git clone git@github.com:NathNacht/immo-eliza-scraping-immozila.git
  • Install all the libraries in requirements.txt
pip install -r requirements.txt
  • Run the script
$ python3 main.py

You will be asked to specify the number of pages to be scraped. Fill in a number.

  • The output will be stored in ./data/cleaned/clean.csv

πŸ‘Ύ Workflow

main

graph TD;
    A["multiWeblinks()"]-->B[Store in weblinks] 
    B--> C["write_json()"];
    C-->D["PropertyScraper(url)"]-->E[Will be stored in scrape_url];
    E-->F["scrape_url.scrape_property_info()"];
    F-->G[Check if house is FOR SALE] 
    G-->H[Fill up dictionary with data];
    H-->I[write to pandas dataframe];
    I-->J["to_csv()"];
Loading

πŸš€ Usage

The project involves discovering and saving the links to the locations of all properties in JSON files. Subsequently, each link undergoes thorough processing to extract the necessary information, which is then transformed into a DataFrame. Finally, the obtained information is written to a CSV file.

πŸ€– Project File structure

β”œβ”€β”€ data
β”‚   β”œβ”€β”€ cleaned
β”‚   └── raw
β”œβ”€β”€ example_data
β”œβ”€β”€ scraper
β”‚   β”œβ”€β”€ scraper.py
β”‚   └── threathimmolinks.py
β”œβ”€β”€ .gitignore
β”œβ”€β”€ main.py
β”œβ”€β”€ README.md
└── requirements.txt

πŸ” Contributors

πŸ“œ Timeline

This project was created in 5 days.

About


Languages

Language:Python 100.0%