cutewindy / COEN-272---webCrawler

course project - crawling websites from a seed url and record word statistics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

COEN-272---webCrawler

##Highlights

  • input => seed url, output => word stats/rank from processed crawled web pages
  • Crawlers - BFS, BlockingQueue, Multi-threaded
  • URL filtering - Bloom Filter (TODO)
  • Page filtering - SimHash (TODO)
  • Information retrieval - Tag/Token counts
  • Word stats/rank - Zipf's law

About

course project - crawling websites from a seed url and record word statistics


Languages

Language:Java 100.0%