GenSpider
GenSpider is a behaviour for defining Spiders.
Spiders are modules which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).
Hello World
The basic Quotes Spider from Scrapy is implemented with gen_spider
in both
Erlang and Elixir.
Generic Spiders
GenSpider also comes with some useful generic spiders that can be found in the examples directory. Their aim is to provide convenient functionality for a few common scraping cases, like following all links on a site based on certain rules, crawling from Sitemaps, or parsing an XML/CSV feed.
Installation
If available in Hex, the package can be installed
by adding gen_spider
to your list of dependencies in mix.exs
:
def deps do
[
{:gen_spider, "~> 0.1.0"}
]
end
Documentation can be generated with ExDoc and published on HexDocs. Once published, the docs can be found at https://hexdocs.pm/gen_spider.
Contributing
We welcome everyone to contribute to GenSpider and help us tackle existing issues!
Use the issue tracker for bug reports or feature requests. Open a pull request when you are ready to contribute.
When submitting a pull request you should not update the CHANGELOG.md
.
License
GenSpider source code is released under Apache 2 License. Check LICENSE file for more information.