tazsingh / crawlie

A simple Elixir web crawler library powered by GenStage.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crawlie (the crawler) badge Coverage Status Hex.pm docs

Crawlie is a simple Elixir library for writing decently-performing crawlers with minimum effort.

Usage example

See the crawlie_example project.

Inner workings

Crawlie uses Elixir's GenStage to parallelise the work. Most of the logic is handled by the UrlManager, which consumes the url collection passed by the user, receives the urls extracted by the subsequent processing, makes sure no url is processed more than once, makes sure that the "discovered urls" collection is as small as possible by traversing the url tree in a roughly depth-first manner.

The urls are requested from the UrlManager by a GenStage Flow, which in parallel fetches the urls using HTTPoison, and parses the responses using user-provided callbacks. Discovered urls get sent back to UrlManager.

Here's a rough diagram:

crawlie architecture diagram

Configuration

See the docs for supported options.

Planned features

  • Easier limiting the crawling to a (sub)domain
  • Option of respecting robots.txt of the websites (on by default)

Installation

The package can be installed as:

  1. Add crawlie to your list of dependencies in mix.exs:
```elixir
def deps do
  [{:crawlie, "~> 0.3.0"}]
end
```
  1. Ensure crawlie is started before your application:
```elixir
def application do
  [applications: [:crawlie]]
end
```

About

A simple Elixir web crawler library powered by GenStage.

License:MIT License


Languages

Language:Elixir 100.0%