jwambugu/crawler

Task

Implement a recursive, mirroring web crawler . The crawler should be a command-line tool that accepts a starting URL and a destination directory. The crawler will then download the page at the URL, save it in the destination directory, and then recursively proceed to any valid links in this page.

A valid link is the value of an href attribute in an <a> tag the resolves to urls that are children of the initial URL.

For example, given initial URL https://start.url/abc , URLs that resolve to https://start.url/abc/foo and https://start.url/abc/foo/bar are valid URLs, but ones that resolve to https://another.domain/ or to https://start.url/baz are not valid URLs, and should be skipped.

Additionally, the crawler should:

Correctly handle being interrupted by Ctrl-C
Perform work in parallel where reasonable
Support resume functionality by checking the destination directory for downloaded pages and skip downloading and processing where not necessary
Provide “happy-path” test coverage

Some tips:

If you’re not familiar with this kind of software, see wget --mirror for very similar functionality
Document missing features and any other changes you would make if you had more time for the assignment implementation.

Usage

Install crawler using the command below:

go install

To view available options and usage, run

crawler --help

To crawl a website, run

crawler -s https://example.com -d downloads

Running tests and benchmarks

The project by default uses github actions to run tests and benchmarks. To run tests locally, please run the commands below.

Tests:

  go test -cover -race ./... -v

Benchmarks:

  go test -bench=.  ./...

Benchmark Results on Apple M1 Pro:

goos: darwin
goarch: arm64
pkg: github.com/jwambugu/crawler/cmd/crawler
BenchmarkCrawler_Crawl-8                           25808             44546 ns/op
BenchmarkCrawler_CrawlWithoutConcurrency-8         34762             34582 ns/op
PASS
ok      github.com/jwambugu/crawler/cmd/crawler 3.955s

Benchmark Results on Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz:

goos: linux
goarch: amd64
pkg: github.com/jwambugu/crawler/cmd/crawler
cpu: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
BenchmarkCrawler_Crawl-2                     	   15939	     75629 ns/op
BenchmarkCrawler_CrawlWithoutConcurrency-2   	   23270	     50507 ns/op
PASS
ok  	github.com/jwambugu/crawler/cmd/crawler	3.680s

Missing features

The crawler should not exit if a link returns a 404. It should attempt to go back to the previous link and skip the missing link's URL.
Keep track of the last crawled link and resume from it instead of starting afresh.

jwambugu / crawler

Task

Usage

Running tests and benchmarks

Missing features

About

Languages