WangCHX / Crawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A list of the files in submission and what they do.

===========

HOW TO RUN

  1. open the terminal
  2. Reach the directory of "Crawler.py"
  3. Run "python WebCrawler.py 'query' n"

query: keywords n: the number of total pages to be downloaded

===========

FILE LIST:

  1. Crawler.py:

The entrance of this crawler. Given query and a number, the crawler will first connect to Google, and return top-10 results(we give them highest priority score(1000)), and then crawl starting from them in focused strategy manner until we download all n pages.

  1. CheckUrl.py

CheckUrl.validifyUrl function is used for normalizing url, like delete index/main/default from the end of url

  1. Crawlable.py

Given a url, return its root site, and decide whether this url can be crawled or not by robot.txt

  1. CheckContent.py

Check if two pages are similar but with different url.(There are some bugs in it, not 100% correct)

  1. SimHash.py

The sim-hash function and hanging distance function used in CheckContent.py

About


Languages

Language:Python 100.0%