generic spider crawler behaviour interface

GenSpider

GenSpider is a behaviour for defining Spiders.

Spiders are modules which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).

Hello World

The basic Quotes Spider from Scrapy is implemented with gen_spider in both Erlang and Elixir.

Generic Spiders

GenSpider also comes with some useful generic spiders that can be found in the examples directory. Their aim is to provide convenient functionality for a few common scraping cases, like following all links on a site based on certain rules, crawling from Sitemaps, or parsing an XML/CSV feed.

Installation

If available in Hex, the package can be installed by adding gen_spider to your list of dependencies in mix.exs:

def deps do
  [
    {:gen_spider, "~> 0.1.0"}
  ]
end

Documentation can be generated with ExDoc and published on HexDocs. Once published, the docs can be found at https://hexdocs.pm/gen_spider.

Contributing

We welcome everyone to contribute to GenSpider and help us tackle existing issues!

Use the issue tracker for bug reports or feature requests. Open a pull request when you are ready to contribute.

When submitting a pull request you should not update the CHANGELOG.md.

License

GenSpider source code is released under Apache 2 License. Check LICENSE file for more information.

About

An Erlang/Elixir behaviour to define Spiders

https://hex.pm/packages/gen_spider

generic spider crawler behaviour interface

Apache License 2.0

Languages

Language:Elixir 74.2%Language:Erlang 25.8%