As the exercise was completed in under 2 hours improvements could of course be made:
- The scraper is returned in a factory function.
- The document is now a field on the scraper struct.
- The document is created using an
io.Reader
value for easier testing. - This means the tests can now pull raw HTML from tests fixtures for parsing. There is an open PR that pulls the fixture data from strings also (using
strings.NewReader
). - Some concurrency has been added for element processing via a
WaitGroup
.
- This scraper has a single dependency, goquery, which simplifies DOM traversal. Otheriwse it's all standard lib (apart from mux for the web part).
- Dependency management is handled using the Go 1.5+ vendor flag and git submodules.
- The current tests hit real HTML endpoints (in gists) which are
slower than mocks but goquery
NewDocument
takes a URI to operate so used them for this pass. Extensions to this code would probably swap to some form of mocking. - The exercise required a console application. This solution delivers both a console and a web version as I had time!
- Time taken to build this solution: 1 hour 45 minutes.
- Clone this repository into your GOPATH.
- Run
source dev_env
to set the Go 1.5+ vendor flag environment variable. - Run
git submodule init
andgit submodule update
.
- In the
console
folder rungo build
. - Run
./console
and bathe in the JSON goodness or./scraped > scraped.json
. - To hit another page with fruity goodness, run
./console -URI=https://gist.githubusercontent.com/buddhamagnet/c6997464d84b8bf379a1/raw/0b2e8ca65fed073197cd7be1e91a163738488f2e/fruity.html
.
- In the
web
folder rungo build
. - Run
./web
and hitlocalhost:9494/scrape
.
- Run
go test ./...
.