dudk/crawl

Recursive crawler

Crawler prints crawled URLs to stdout.

Input

Design

The crawler is built on top of the following astractions:

type (
	// Scheduler manages fetching execution.
	Scheduler interface {
		Start(context.Context, Fetcher) chan struct{}
	}

	// Fetcher is responsible for fetching content of the web page and
	// calling BodyReader and .
	Fetcher interface {
		Fetch(context.Context, Visitor, BodyReader, string) []string
	}

	// BodyReader reads the HTTP response body. It must consume all data
	// available in io.Reader.
	BodyReader interface {
		ReadBody(context.Context, io.Reader) error
	}

	// BodyReader reads the HTTP response body. It must consume all data
	// available in io.Reader. BodyReader must respect context and return
	// if it's cancelled.
	BodyReader interface {
		ReadBody(context.Context, *url.URL, io.Reader) error
	}

	// Visitor records if parsed URL was visited. It's safe for concurrent
	// use.
	Visitor interface {
		Visit(string) bool
	}
)

Instructions

Run the following commands:

go build
./crawl -url https://godoc.org/

dudk / crawl

Recursive crawler

Input

Design

Instructions

About

Languages