dudk / crawl

A naive web crawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Recursive crawler

Crawler prints crawled URLs to stdout.

Input

  • URL

Design

The crawler is built on top of the following astractions:

type (
	// Scheduler manages fetching execution.
	Scheduler interface {
		Start(context.Context, Fetcher) chan struct{}
	}

	// Fetcher is responsible for fetching content of the web page and
	// calling BodyReader and .
	Fetcher interface {
		Fetch(context.Context, Visitor, BodyReader, string) []string
	}

	// BodyReader reads the HTTP response body. It must consume all data
	// available in io.Reader.
	BodyReader interface {
		ReadBody(context.Context, io.Reader) error
	}

	// BodyReader reads the HTTP response body. It must consume all data
	// available in io.Reader. BodyReader must respect context and return
	// if it's cancelled.
	BodyReader interface {
		ReadBody(context.Context, *url.URL, io.Reader) error
	}

	// Visitor records if parsed URL was visited. It's safe for concurrent
	// use.
	Visitor interface {
		Visit(string) bool
	}
)

Instructions

Run the following commands:

go build
./crawl -url https://godoc.org/

About

A naive web crawler


Languages

Language:Go 100.0%