magic7721 / Crawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Turorial

  • in settings.py, change WEBSITE value to 'NIPS', 'ICML', ICLR' or 'CVPR' (coresponding to databases currently in folder)

  • run function sqlite_query() in /spiders/__ init __ .py

  • it is basically an sqlite simulator, just query it like sqlite

Building new Database from website

  • in settings.py, change WEBSITE and YEAR values (default website is 'ICLR' (iclr.cc), year is [2021,2022])

  • run function build_database() in /spiders/__ init __ .py

  • WARNING, it will DELETE existing DATABASE if the names match

Explaining files

  • website_crawler.py: web crawler using Scrapy library, scrape site for data

  • search_api.py: web crawler using arxiv and wikipedia api, search for extra info

  • schema.py: build relation tables

  • GUI.py: interface

  • test.py: random stuffs

Using scrapy in terminal

  • for "scrapy crawl" command, open terminal in /spiders

  • for anything else, open terminal in root folder

About


Languages

Language:Python 100.0%