chriswilding / crawl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crawl

Introduction

We'd like you to write a simple web crawler in a programming language you're familiar with. Given a starting URL, the crawler should visit each URL it finds on the same domain. It should print each URL visited, and a list of links found on that page. The crawler should be limited to one subdomain - so when you start with http://www.example.com/, it would crawl all pages on the monzo.com website, but not follow external links, for example to facebook.com.

We would like to see your own implementation of a web crawler. Please do not use frameworks like scrapy or go-colly which handle all the crawling behind the scenes or someone else's code. You are welcome to use libraries to handle things like HTML parsing.

Prerequisites

  1. Go 1.17

Setup

$ git clone git@github.com:ChrisWilding/crawl.git
$ cd crawl

How To

Test

$ go test -v ./...

Run

$ go build .
$ ./crawl --help
Usage of ./crawl:
  -limit int
        limit to the number of levels of links to follow (default 100)
  -url string
        the url to crawl (default "https://www.example.com")

Run with Docker

$ docker pull ghcr.io/chriswilding/crawl:latest
$ docker run --rm -ti ghcr.io/chriswilding/crawl:latest --help
Usage of /ko-app/crawl:
  -limit int
        limit to the number of levels of links to follow (default 100)
  -url string
        the url to crawl (default "https://www.example.com")

About

License:Apache License 2.0


Languages

Language:Go 100.0%