gushonorato / mechanize

Build web scrapers and automate interaction with websites in Elixir with ease!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mechanize Build Status Coverage Status

Build web scrapers and automate interaction with websites in Elixir with ease!

One of Mechanize's main design goals is to enable developers to easily create concurrent web scrapers without the computing cost of using headless browsers. Mechanize is heavily inspired on Ruby version of Mechanize. It features:

  • Follow hyperlinks
  • Scrape data easily using CSS selectors
  • Populate and submit forms
  • Follow and tracks 3xx redirects
  • Follow meta-refresh
  • Automatically stores and sends cookies (TODO)
  • Proxy support (TODO)
  • Track of the sites that you have visited as a history (TODO)
  • File upload (TODO)
  • Obey robots.txt (TODO)

Installation

Warning: This library is in active development and probably will have changes in the public API. Use it carefully in production systems.

The package can be installed by adding mechanize to your list of dependencies in mix.exs:

def deps do
  [
    {:mechanize, "~> 0.1"}
  ]
end

Getting started

This guide will teach you how to do the most basic tasks using Mechanize like fetch pages, click links, fill out and submit forms and scrape data.

Fetching a page

First you'll have to start Mechanize:

alias Mechanize.Browser

browser = Browser.new()

Or using a more verbose alternative:

{:ok, browser} = Browser.start_link()

Now we'll use the browser we've started to fetch a page. Let's fetch Google with our mechanize browser:

page = Browser.get!(browser, "https://www.google.com")

What just happened? We told mechanize to go pick up Google's main page. Mechanize followed any redirects that Google may have sent. The browser gave us back a page that we can use to scrape data, find links to click, or find forms to fill out.

Next, let's try finding some links to click.

Finding Links

Mechanize returns a page struct whenever you get a page, post, or submit a form. Now that we've fetched Google's homepage, let's try listing all of the links:

alias Mechanize.Page
alias Mechanize.Page.Element

page
|> Page.links()
|> Enum.each(fn link ->
  IO.puts Element.text(link)
end)

We can list the links, but Mechanize gives a few shortcuts to help us find a link to click on. Let's say we wanted to click the link whose text is 'News'. Normally, we would have to do this:

alias Mechanize.Page
alias Mechanize.Page.Element
alias Mechanize.Page.Link

page
|> Page.links()
|> Enum.filter(fn link -> Element.text(link) == "News" end)
|> List.first()
|> Link.click!()

But Mechanize gives us a shortcut. Instead we can do this:

alias Mechanize.Page
alias Mechanize.Page.Link

page
|> Page.link_with(text: "News")
|> Link.click!()

Or even shorter, with just one line:

alias Mechanize.Page

Page.click_link!(page, text: "News")

You're probably thinking "there could be multiple links with that text!", and you would be correct! If you use the plural form, you can access the list. If you wanted to click on the second news link, you could do this:

alias Mechanize.Page

  page
  |> Page.links_with(text: "News")
  |> Enum.at(1)

We can even find a link matching its href with some regular expression:

alias Mechanize.Page

Page.link_with(page, href: ~r/something/)

Or chain them together to find a link with certain text and certain href:

alias Mechanize.Page

Page.link_with(page, text: 'News', href: "/news")

Now that we know how to find and click links, let's try something more complicated like filling out a form.

Filling out forms

Let's continue with our Google example.

If we look at the html of the page, we can see that there is one form named 'f', that has a couple buttons and a few fields. You can see this by saving the page in a file and opening it in your favorite text editor.

File.write!("google.html", page)

Now that we know the name of the form, let's fetch it off the page:

form = Page.form_with(name: "f")

So let's set the form field named 'q' on the form to 'elixir mechanize':

Form.fill_text(form, name: "q", with: keyword)

Now we can submit the form and 'press' the submit button and print the results:

Form.click_button!(form, text: "Google Search")

What we just did was equivalent to putting text in the search field and clicking the 'Google Search' button.

Another way to do that is typing in the text field and hitting the return button. We can also simulate that by using submit function instead of click_button:

Form.submit!(form)

Let's take a look at the code all together:

alias Mechanize.{Browser, Page, Form}

b = Browser.new(follow_meta_refresh: true)
    |> Browser.put_user_agent(:mac_safari)

b
|> Browser.get!("https://www.google.com")
|> Page.form_with(name: "f")
|> Form.fill_text(name: "q", with: "elixir mechanize")
|> Form.submit!() # or Form.click_button!(form, text: "Google Search")

Before we go on to screen scraping, let's take a look at forms a little more in depth. Unless you want to skip ahead!

Advanced Form techniques

In this section, I want to touch on using the different types in input fields possible with a form. Password and textarea fields can be treated just like text input fields. Select fields are very similar to text fields, but they have many options associated with them. If you select one option, mechanize will de-select the other options (unless it is a multi select!).

For example, let's select an option with text "Option 1" on a select with name="select1".

Form.select(form, name: "select1", option: "Option 1")

We can also select an option by an attribute, in this case we'll select by value attribute:

Form.select(form, name: "select1", option: [value: "1"])

Or select the third option of a select (note that Mechanize uses a zero-based index):

Form.select(form, name: "select1", option: 2)

Now let's take a look at checkboxes and radio buttons. To select a checkbox, just check it like this:

Form.check_checkbox(form, name: "box", value: "yes")

Radio buttons are very similar to checkboxes, but they know how to uncheck other radio buttons of the same name. Just check a radio button like you would a checkbox:

Form.check_radio_button(form, name: "box", value: "yes")

Scraping Data

After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Page.search/2 function:

browser
|> Browser.get!('http://example.com/')
|> Page.search("p.posted")

Example

Google (Print results from SERP)

alias Mechanize.{Browser, Page, Form}
alias Mechanize.Page.Element

b =
  Browser.new(follow_meta_refresh: true)
  |> Browser.put_user_agent(:mac_safari)

initial_page = Browser.get!(b, "https://google.com")

serp =
  initial_page
  |> Page.form_with(name: "f")
  |> Form.fill_text(name: "q", with: keyword)
  |> Form.submit!()

serp
|> Page.search(".kCrYT > a .vvjwJb") # Selects each search result element
|> Enum.map(&Element.text/1) # Extracts search result title text
|> Enum.with_index(1)
|> Enum.each(fn {result, index} -> IO.puts("#{index}. #{result}") end)

Author

Copyright © 2020 by Gustavo Honorato (gustavohonorato@gmail.com)

License

This library is distributed under the MIT license. Please see the LICENSE file.

About

Build web scrapers and automate interaction with websites in Elixir with ease!

License:MIT License


Languages

Language:Elixir 92.1%Language:HTML 7.9%