An Elixir package to scrape websites. This is an attempt to rewrite meteor-scrape from scratch, leveraging the expressiveness and power of Elixir. Current features:
- can handle non-utf-8 sources.
- can deal with timezones.
- parse RSS/Atom feeds.
- parse common websites.
- parse advanced content websites ("articles").
Add scrape
to your mixfile:
{:scrape, "~> 1.2"}
and add :scrape, :floki, :parallel, :timex
to your applications list in your mixfile.
# Feed scraping:
Scrape.feed "http://feeds.feedburner.com/venturebeat/SZYF"
# result (list of items):
[
%{
description: "GUEST: For years, many have believed the startup world would be doomed by the “Series A Crunch,” the natural result of an explosion of seed funding paired with an increasingly high bar required to earn a Series A. Industry observers believed we’d be witnessing a train wreck of epic proportions as companies died off. But the […]",
image: "http://i1.wp.com/venturebeat.com/wp-content/uploads/2015/11/seed-extensions.jpg?resize=160%2C140",
pubdate: #<DateTime(4016-07-03T22:10:33Z)>,
tags: [
%{accuracy: 0.9, name: "micah rosenbloom"},
%{accuracy: 0.9, name: "deals"},
%{accuracy: 0.9, name: "seed funding"},
%{accuracy: 0.9, name: "series a crunch"},
%{accuracy: 0.9, name: "business"}
],
title: "Why seed ‘extensions’ are becoming the new normal in fundraising",
url: "http://venturebeat.com/2015/11/07/why-seed-extensions-are-becoming-the-new-normal-in-fundraising/"},
%{...},
...
]
# Scrape a website:
Scrape.website "http://www.latimes.com"
# Result (basic metadata):
%Scrape.Website{
description: "The LA Times is a leading source of breaking news, entertainment, sports, politics, and more for Southern California and the world.",
favicon: "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png",
feeds: ["http://www.latimes.com/rss2.0.xml"],
image: nil,
tags: [
%{accuracy: 0.9, name: "california"},
%{accuracy: 0.9, name: "california news"},
%{accuracy: 0.9, name: "lakers coverage"},
%{accuracy: 0.9, name: "west coast news"},
...
],
title: "Los Angeles Times - California, national and world news - Los Angeles Times",
url: "http://www.latimes.com/"}
# Scrape an article (aka "content website")
Scrape.article "http://www.bbc.com/news/world-europe-34753464"
# Result
%Scrape.Article{
description: "The Russian plane crash in Egypt was not due to technical failures, say French aviation officials, adding that the flight data recorder suggests a \"violent, sudden\" explosion.",
favicon: "http://static.bbci.co.uk/news/1.96.1453/apple-touch-icon.png",
fulltext: "Other French officials said the flight data recorder suggested a \"violent, sudden\" explosion caused the crash, killing all 224 people on board.\n\nThe Metrojet Airbus A321 was flying [...shortened...]",
image: "http://ichef.bbci.co.uk/news/1024/cpsprodpb/A4F2/production/_86562224_86562223.jpg",
tags: [%{accuracy: 0.7628205128205128, name: "french"},
%{accuracy: 0.6730769230769231, name: "technical"},
%{accuracy: 0.6730769230769231, name: "plane"},
%{accuracy: 0.5384615384615385, name: "bbc"},
%{accuracy: 0.40384615384615385, name: "newsrussian"},
%{accuracy: 0.358974358974359, name: "flight"},
%{accuracy: 0.358974358974359, name: "egypt"},
%{accuracy: 0.3141025641025641, name: "russian"},
%{accuracy: 0.3141025641025641, name: "data"},
%{accuracy: 0.3141025641025641, name: "recorder"},
...
],
title: "Russian plane crash: French 'rule out technical failure' - BBC News",
url: "http://www.bbc.com/news/world-europe-34753464"}
# Scrape a feed and return only it's item urls:
Scrape.feed "http://example.com/feed", :minimal
# Result
["url1", "url2", ...]
LGPLv3. Use this library however you want, but I want improvements & bugfixes to flow back into this package.