calini / grape

Simple, powerful and configurable scraper written in Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Grape

go-report-card FOSSA Status

This is a fork of philipithomas/iterscraper. Thanks, Philip I. Thomas.

The link can contain either an incrementing id or a token that can be passed from a file (more on that later). Information is retrieved from HTML elements, and outputted as a CSV.

Thanks Francesc for featuring the original repo in episode #1 of Just For Func. Watch The Video or Review Francesc's pull request.

Installation

$ go get -u go.ilie.io/grape

Modes

There are three modes you can query for data.

  1. Iterative
$ grape                                  \
    -url      https://github.com/%d      \
    -from     100                        \
    -to       105                        \
    -query    ".p-name .p-org .p-label"

This mode will iterate over the indexes 100-105. (Interesting to see that username 100 exists)

  1. Dictionary
$ grape                                                     \
    -dict     $GOPATH/src/go.ilie.io/grape/dicts/users.txt  \
    -url      https://github.com/%s                         \
    -query    ".p-name"

This mode will use a dictionary and query each term in it.

An example result looks like this

url                          id        .p-name
https://github.com/calini    calini    Calin Ilie
  1. Dictionary range
$ grape                                                     \
    -dict     $GOPATH/src/go.ilie.io/grape/dicts/users.txt  \
    -from     2                                             \
    -to       4                                             \
    -url      https://github.com/%s                         \
    -query    ".p-name .p-org .p-label"

This mode will use a dictionary and query each term within the specified range.

Selector Syntax

You can select HTML elements with classic JQUery syntax (thanks to GoQuery). The only difference is, I have added the ability to use § as a separator to be able to for attributes of the element, not only it's text. Example:

$ grape                                                   \
  -dict     $GOPATH/src/go.ilie.io/grape/dicts/users.txt  \
  -url      https://github.com/%s                         \
  -query    ".p-name .u-photo>img§src"

Will produce:

url                          id        .p-name       .u-photo>img§src 
https://github.com/calini    calini    Calin Ilie    https://avatars2.githubusercontent.com/u/9298529?s=460&v=4

Flags

The manatory flag is -url.

For an explanation of the options, type iterscraper -help

General usage of iterscraper:

TODO REPLACE THIS WITH `grape -help`

Errata

  • On a 429 - too many requests error, the app logs and continues, ignoring the request.
  • On a 404 - not found error, the system will log the miss, then continue. It is not exported to the CSV.
  • The package will follow up to 10 redirects

License

The project itself is submitted under MIT License

FOSSA Status

About

Simple, powerful and configurable scraper written in Go

License:MIT License


Languages

Language:Go 100.0%