kvnyu24 / real-estate-scraper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Deecamp scrapers written with Scrapy]

Installation

  • run pip install -r requirements.txt
  • install and setup mongodb at port 27017
  • Fill in your own parameter specs for Aliyun APIs.

Usage

  • Scrape down data from webpages
    • Sample
      • run scrapy crawl ZillowBuySpider -o 'try_zillow_buy.json' in the main directory
    • Customization
      • Create an item file that contains an object used to store data
      • Create a spider that stores the main logic of the crawling
      • run scrapy crawl SPIDERNAME -o test.json to test the spider
    • Supports
      • zillow (buy/rent)
      • fangtianxia (ershoufang, zuxiezilou, shouxiezilou)
      • baiduqianxi (city/province-wise in/out, immigration/emmigration, internal flow)
      • qqheat (xingyun map)
      • beikezufang (ershoufang, xinfang, zufang)
    • API (bucket names can be easily changed from source code)
      • Base API

        • method: crawl
          • customizable crawling & upload files to oss & save to mongodb
        • method: get
          • download json files from oss to target dirrectories for usage
      • QQXingyuntu API

        •   from api.qqheat_api import QQHeatAPI
            # outside of this folder:
            # from deecamp_scraper.api.qqheat_api import QQHeatAPI  
          
            qqheat = QQHeatAPI(30) #time_between=30s
            # begin crawling
            qqheat.crawl()
            # get (timestamp_a, timestamp_b, save_dir)
            qqheat.get("2020_07_11", "2020_07_12", "tmp")
      • Beikezufang API

        •   from api.ke_api import KeAPI 
            # outside of this folder:
            # from deecamp_scraper.api.ke_api import KeAPI  
          
            ke_try = KeAPI(5, "北京", 30) #page_num=5, location="北京", time_between=30s
            # begin crawling
            ke_try.crawl() 
            # get (timestamp_a, timestamp_b, save_dir)
            ke_try.get("2020_07_11", "2020_07_12", "Esf", "tmp")
      • Baiduqianxi API

        •   from api.baiduqx_api import BaiduQXAPI 
            # outside of this folder:
            # from deecamp_scraper.api.baiduqx_api import BaiduQXAPI  
          
            baiduqx = BaiduQXAPI(30) #time_between=30s
            # begin crawling
            baiduqx.crawl() 
            # get (timestamp_a, timestamp_b, save_dir)
            baiduqx.get("2020_07_11", "2020_07_12", "Io", "tmp")

About


Languages

Language:Python 100.0%