This library exists to make scraping a bit easier for business use cases
Available in Hex, the package can be installed
by adding scraper_ex
to your list of dependencies in mix.exs
:
def deps do
[
{:scraper_ex, "~> 0.1.0"}
]
end
The docs can be found at https://hexdocs.pm/scraper_ex.
ScraperEx
uses Hound under the hood, which means you can configure hound to use any browser/runner you'd like. By default we use chrome_headless
This module exists to manage windows with Hound. Hounds window management by default doesn't help very much with session management, which leads to
zombie windows hanging around and can really start to eat up memory. To avoid this we can use ScraperEx.Window
to run and interact with a
individual session
The two useful functions in here are ScraperEx.run_task_in_window
and
ScraperEx.run_task
, run task allows you to input various steps for a
scraper while run_in_window will also start a window for you, the bare
version won't and you will need to manage your own ScraperEx.Window
Tasks (Flows) are defined by configs, you can either use the struct form using ScraperEx.Task.Config
modules or use the short forms
The following actions are currently implemented:
:navigate_to
orScraperEx.Task.Config.Navigate
:input
orScraperEx.Task.Config.Input
:click
orScraperEx.Task.Config.Click
:read
orScraperEx.Task.Config.Read
:screenshot
orScraperEx.Task.Config.Screenshot
:scroll
orScraperEx.Task.Config.Scroll
:sleep
orScraperEx.Task.Config.Sleep
:send_text
orScraperEx.Task.Config.SendText
:send_keys
orScraperEx.Task.Config.SendKeys
:javascript
orScraperEx.Task.Config.Javascript
You can allow errors by wrapping a command in
ScraperEx.allow_error({:click, {:css, ".thing"}})
iex> ScraperEx.run_task_in_window([
...> {:navigate_to, "https://en.wikipedia.org/wiki/Example.com"},
...> {:read, :references, {:css, ".reference-text"}},
...> {:read, :page_title, {:id, "firstHeading"}},
...> {:read, :external_link_4, {:css, "#bodyContent ul:nth-child(21) li:nth-child(4)"}},
...> {:click, {:css, "h2:has(#External_links) + ul li:nth-of-type(3) a"}, :timer.seconds(1)},
...> {:read, :clicked_url, {:css, "h1"}},
...> ])
{:ok, %{ \
page_title: "example.com", \
external_link_4: "example.edu", \
clicked_url: "Example Domain", \
references: [ \
"\"IANA WHOIS Service\". IANA. Retrieved 2022-10-25.", \
"\"IANA-managed Reserved Domains\". IANA. Retrieved 2020-06-20.", \
"RFC 2606, Reserved Top Level DNS Names, D. Eastlake, A. Panitz, The Internet Society (June 1999), Section 3.", \
"RFC 6761, S. Cheshire, M. Krochmal, Special-Use Domain Names, IETF (February 2013)" \
] \
}}
We can test and mock out specific flow responses by using ScraperEx.Sandbox
First we must call ScraperEx.Sandbox.start_link()
in our test_helpers.ex
file, then
Inside our test, we can do
ScraperEx.Sandbox.set_run_task_result(my_flow(), %{my_result: :ok})
in each test to set the response of a specific flow.