A list of the files in submission and what they do.
===========
HOW TO RUN
- open the terminal
- Reach the directory of "Crawler.py"
- Run "python WebCrawler.py 'query' n"
query: keywords n: the number of total pages to be downloaded
===========
FILE LIST:
- Crawler.py:
The entrance of this crawler. Given query and a number, the crawler will first connect to Google, and return top-10 results(we give them highest priority score(1000)), and then crawl starting from them in focused strategy manner until we download all n pages.
- CheckUrl.py
CheckUrl.validifyUrl function is used for normalizing url, like delete index/main/default from the end of url
- Crawlable.py
Given a url, return its root site, and decide whether this url can be crawled or not by robot.txt
- CheckContent.py
Check if two pages are similar but with different url.(There are some bugs in it, not 100% correct)
- SimHash.py
The sim-hash function and hanging distance function used in CheckContent.py