jacygao / spiderman

A simple web crawler that crawls a website n-links deep and calculate the number of unique rendered words found on each page and in total.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spiderman

A simple web crawler that crawls a website n-links deep and calculate the number of unique rendered words found on each page and in total.

One-time setup

Install Gumbo (https://github.com/google/gumbo-parser)

  git clone https://github.com/google/gumbo-parser.git
  
  $ ./autogen.sh
  $ ./configure
  $ make
  $ sudo make install

For Mac with Homebrew, do:

  brew install gumbo-parser

Clone Spiderman repo

  git clone https://github.com/JacyGao/spiderman.git

To compile Spiderman, do:

  tools/all.sh

To run Spiderman, do:

  ./a.out {url} {depth}

For example

  ./a.out http://www.ea.com 1

About

A simple web crawler that crawls a website n-links deep and calculate the number of unique rendered words found on each page and in total.


Languages

Language:C++ 100.0%