GongSong / pulsarr

Scrape web data at scale completely and accurately with high performance, distributed RPA.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What is PulsarR?

English | 简体中文 | **镜像

Introduction

PulsarR is the ultimate open source solution to scrape web data at scale.

PulsarR [ˈpʌlsɑr] is short for Pulsar Radiation. If there is no confusion with other projects, it can be referred to as Pulsar.

Extracting Web data at scale is extremely hard. Websites change frequently and are becoming more complex, meaning web data collected is often inaccurate or incomplete, PulsarR has developed a range of cutting-edge technologies to solve this problem.

We have released complete solutions for site-wide Web scraping for some of the largest e-commerce websites. These solutions meet the highest standards of performance, quality and cost. They will be free and open source forever, such as:

PulsarR supports high-quality, large-scale Web data collection and processing. We have developed a range of infrastructure and cutting-edge technologies to ensure the highest standards of performance, quality and TCO (total cost of ownership), even in very large-scale data collection scenarios.

PulsarR supports the Network-As-A-Database paradigm. PulsarR treats the external network as a database. If the required data is not in the local storage, or the existing version does not meet the analysis needs, the system will collect the latest version of the data from the Internet. We also developed X-SQL to query the Web directly and convert webpages into tables and charts.

PulsarR supports browser rendering as the primary method to collect Web data. By using browser rendering as the primary method to collect Web data, we achieve an optimal balance between data point scale, data quality, labor cost and hardware cost, and achieve the lowest TCO (total cost of ownership). With optimizations such as blocking unnecessary resource files, the performance of browser rendering can even be comparable to the traditional single resource collection method.

PulsarR supports RPA based Web scraping. PulsarR includes an RPA subsystem for Web interaction: scrolling, typing, screen capture, dragging and dropping, clicking, etc. This subsystem is similar to the well-known selenium, playwright, puppeteer, but all behaviors are optimized, such as more realistic simulation, better execution performance, better parallelism, better fault tolerance, and so on.

PulsarR supports single resource collection. PulsarR’s default data collection method is to harvest the complete Web data through browser rendering, but if the data you need can be retrieved through a single link, for example through an ajax interface, you can also call PulsarR’s resource collection method for high-speed collection.

PulsarR plans to support cutting-edge information extraction technology. We plan to release an advanced AI to automatically extract every field from all valuable webpages (e.g. product detail pages) with remarkable accuracy, and we currently offer a preview version.

Features

  • Web spider: browser rendering, ajax data crawling

  • RPA: robotic process automation, mimic human behaviors, SPA crawling, or do something else valuable

  • Simple API: single line of code to scrape, or single SQL to turn a website into a table

  • X-SQL: extended SQL to manage web data: Web crawling, scraping, Web content mining, Web BI

  • Bot stealth: web driver stealth, IP rotation, privacy context rotation, never get banned

  • High performance: highly optimized, rendering hundreds of pages in parallel on a single machine without be blocked

  • Low cost: scraping 100,000 browser rendered e-comm webpages, or n * 10,000,000 data points each day, only 8 core CPU/32G memory are required

  • Data quantity assurance: smart retry, accurate scheduling, web data lifecycle management

  • Large scale: fully distributed, designed for large scale crawling

  • Big data: various backend storage support: Local File/MongoDB/HBase/Gora

  • Logs & metrics: monitored closely and every event is recorded

  • [Preview] Information Extraction: Learns Web data patterns and automatically extracts every field in a webpage with remarkable precision

Get started

Most scraping attempts can start with (almost) a single line of code:

Kotlin:

fun main() = PulsarContexts.createSession().scrapeOutPages(
  "https://www.amazon.com/", "-outLink a[href~=/dp/]", listOf("#title", "#acrCustomerReviewText"))

The code above will scrape fields specified by css selectors #title and #acrCustomerReviewText from a set of product pages. Example code: kotlin.

Most real world crawl projects can start with the following code snippet:

Kotlin:

fun main() {
    val context = PulsarContexts.create()

    val parseHandler = { _: WebPage, document: FeaturedDocument ->
        // use the document
        // ...
        // and then extract further hyperlinks
        context.submitAll(document.selectHyperlinks("a[href~=/dp/]"))
    }
    val urls = LinkExtractors.fromResource("seeds10.txt")
        .map { ParsableHyperlink("$it -refresh", parseHandler) }
    context.submitAll(urls).await()
}

Example code: kotlin, java.

The most complicated crawl challenges can start with RPA:

Kotlin:

val options = session.options(args)
val event = options.event.browseEvent
event.onBrowserLaunched.addLast { page, driver ->
    // warp up the browser to avoid being blocked by the website,
    // or choose the global settings, such as your location.
    warnUpBrowser(page, driver)
}
event.onWillFetch.addLast { page, driver ->
    // have to visit a referrer page before we can visit the desired page
    waitForReferrer(page, driver)
    // websites may prevent us from opening too many pages at a time, so we should open links one by one.
    waitForPreviousPage(page, driver)
}
event.onWillCheckDocumentState.addLast { page, driver ->
    // wait for a special fields to appear on the page
    driver.waitForSelector("body h1[itemprop=name]")
    // close the mask layer, it might be promotions, ads, or something else.
    driver.click(".mask-layer-close-button")
}
// visit the URL and trigger events
session.load(url, options)

Example code: kotlin.

Core concepts

The core Pulsar concepts include the following, knowing these core concepts, you can use PulsarR to solve the most demanding data scraping tasks:

  • Web Scraping: the process of using bots to extract content and data from a website

  • Auto Extract: learn the data schema automatically and extract every field from webpages, powered by cutting-edge AI algorithm

  • RPA: stands for robotic process automation which is the only way to scrape modern webpages

  • Network As A Database: access the network just like a database

  • X-SQL: query the Web using SQL directly

  • Pulsar Session: provides a set of simple, powerful and flexible APIs to do web scraping tasks

  • Web Driver: defines a concise interface to visit and interact with webpages, all behaviors are optimized to mimic real people as closely as possible

  • URLs: a URL in Pulsar is a normal URL with extra information to describe a task. Every task in Pulsar is defined as some form of URL

  • Hyperlinks: a Hyperlink in Pulsar is a normal Hyperlink with extra information to describe a task

  • Load Options: load options, or load arguments are control parameters that affect how Pulsar loads, fetches and crawls webpages

  • Event Handlers: capture and process events throughout the lifecycle of webpages

Check Pulsar concepts for details.

Usage I: experience Pulsar with an executable jar

We have released a standalone executable jar based on PulsarR, which includes:

  • Web scraping examples of a set of top sites

  • An applet based on self-supervised machine learning for information extraction, AI identifies all fields on the detail page with over 90% field accuracy of 99.9% or more

  • An applet based on self-supervised machine learning and outputs all extract rules, which can help traditional Web scraping methods

  • An applet that scrape Web data directly from the command line, like wget or curl, without writing code

  • An upgraded Pulsar server to which we can send SQLs to collect Web data

  • A Web UI from which we can write SQLs and send them to the server

Download Exotic and explore its capabilities with a single command line:

java -jar exotic-standalone.jar

Usage II: use Pulsar as a library

The simplest way to leverage the power of Pulsar is to add it to your project as a library.

Maven:

<dependency>
  <groupId>ai.platon.pulsar</groupId>
  <artifactId>pulsar-all</artifactId>
  <version>1.10.9</version>
</dependency>

Gradle:

implementation("ai.platon.pulsar:pulsar-all:1.10.9")

You can also clone the template project from github.com: kotlin, java-11, java-17.

Basic usage

Kotlin:

// Create a pulsar session
val session = PulsarContexts.createSession()
// The main url we are playing with
val url = "https://www.amazon.com/dp/B09V3KXJPB"

// Load a page from local storage, or fetch it from the Internet if it does not exist or has expired
val page = session.load(url, "-expires 10s")

// Submit a url to the URL pool, the submitted url will be processed in a crawl loop
session.submit(url, "-expires 10s")

// Parse the page content into a document
val document = session.parse(page)
// do something with the document
// ...

// Load and parse
val document2 = session.loadDocument(url, "-expires 10s")
// do something with the document
// ...

// Load the portal page and then load all links specified by `-outLink`.
// Option `-outLink` specifies the cssSelector to select links in the portal page to load.
// Option `-topLinks` specifies the maximal number of links selected by `-outLink`.
val pages = session.loadOutPages(url, "-expires 10s -itemExpires 10s -outLink a[href~=/dp/] -topLinks 10")

// Load the portal page and submit the out links specified by `-outLink` to the URL pool.
// Option `-outLink` specifies the cssSelector to select links in the portal page to submit.
// Option `-topLinks` specifies the maximal number of links selected by `-outLink`.
session.submitOutPages(url, "-expires 1d -itemExpires 7d -outLink a[href~=/dp/] -topLinks 10")

// Load, parse and scrape fields
val fields = session.scrape(url, "-expires 10s", "#centerCol",
    listOf("#title", "#acrCustomerReviewText"))

// Load, parse and scrape named fields
val fields2 = session.scrape(url, "-i 10s", "#centerCol",
    mapOf("title" to "#title", "reviews" to "#acrCustomerReviewText"))

// Load, parse and scrape named fields
val fields3 = session.scrapeOutPages(url, "-i 10s -ii 10s -outLink a[href~=/dp/] -topLink 10", "#centerCol",
    mapOf("title" to "#title", "reviews" to "#acrCustomerReviewText"))

// Add `-parse` option to activate the parsing subsystem
val page10 = session.load(url, "-parse -expires 10s")

// Kotlin suspend calls
val page11 = runBlocking { session.loadDeferred(url, "-expires 10s") }

// Java-style async calls
session.loadAsync(url, "-expires 10s").thenApply(session::parse).thenAccept(session::export)

Example code: kotlin, java.

Load options

Most of our scrape methods accept a parameter called load options, or load arguments, to control how to load, fetch and scrape a webpage.

-expires     // The expiry time of a page
-itemExpires // The expiry time of item pages in batch scraping methods
-outLink     // The selector of out links to scrape
-refresh     // Force (re)fetch the page, just like hitting the refresh button on a real browser
-parse       // Activate parse subsystem
-resource    // Fetch the url as a resource without browser rendering

Check Load Options for details.

Extracting web data

Pulsar uses jsoup to extract data from HTML documents. Jsoup parses HTML to the same DOM as modern browsers do. Check selector-syntax for all the supported CSS selectors.

Kotlin:

val document = session.loadDocument(url, "-expires 1d")
val price = document.selectFirst('.price').text()

Continuous crawls

It’s really simple to scrape a massive url collection or run continuous crawls in Pulsar.

Kotlin:

fun main() {
    val context = PulsarContexts.create()

    val parseHandler = { _: WebPage, document: FeaturedDocument ->
        // do something wonderful with the document
        println(document.getTitle() + "\t|\t" + document.getBaseUri())
    }
    val urls = LinkExtractors.fromResource("seeds.txt")
        .map { ParsableHyperlink("$it -refresh", parseHandler) }
    context.submitAll(urls)
    // feel free to submit millions of urls here
    context.submitAll(urls)
    // wait until all tasks are done
    context.await()
}

Java:

public class ContinuousCrawler {

    private static void onParse(WebPage page, FeaturedDocument document) {
        // do something wonderful with the document
        System.out.println(document.getTitle() + "\t|\t" + document.getBaseUri());
    }

    public static void main(String[] args) {
        PulsarContext context = PulsarContexts.create();

        List<Hyperlink> urls = LinkExtractors.fromResource("seeds.txt")
                .stream()
                .map(seed -> new ParsableHyperlink(seed, ContinuousCrawler::onParse))
                .collect(Collectors.toList());
        context.submitAll(urls);
        // feel free to submit millions of urls here
        context.submitAll(urls);
        // wait until all tasks are done
        context.await();
    }
}

Example code: kotlin, java.

RPA (Robotic Process Automation)

As websites become more and more complicated, RPA has become the only way to collect data from some website, such as websites using Custom Font technology.

Pulsar provides a convenient way to mimic real people during the lifecycle of a webpage, using a web driver to interact with the webpage: scrolling, typing, screen capturing, dragging and dropping, clicking and more, all actions and behaviors are optimized to mimic real people as closely as possible.

Here is a typical RPA code snippet, which is required to collect data from most top e-comm sites.

Kotlin:

val options = session.options(args)
val event = options.event.browseEvent
event.onBrowserLaunched.addLast { page, driver ->
    // warp up the browser to avoid being blocked by the website,
    // or choose the global settings, such as your location.
    warnUpBrowser(page, driver)
}
event.onWillFetch.addLast { page, driver ->
    // have to visit a referrer page before we can visit the desired page
    waitForReferrer(page, driver)
    // websites may prevent us from opening too many pages at a time, so we should open links one by one.
    waitForPreviousPage(page, driver)
}
event.onWillCheckDocumentState.addLast { page, driver ->
    // wait for a special fields to appear on the page
    driver.waitForSelector("body h1[itemprop=name]")
    // close the mask layer, it might be promotions, ads, or something else.
    driver.click(".mask-layer-close-button")
}
// visit the URL and trigger events
session.load(url, options)

Example code: kotlin.

Use X-SQL to query the web

Scrape a single page:

select
      dom_first_text(dom, '#productTitle') as title,
      dom_first_text(dom, '#bylineInfo') as brand,
      dom_first_text(dom, '#price tr td:matches(^Price) ~ td, #corePrice_desktop tr td:matches(^Price) ~ td') as price,
      dom_first_text(dom, '#acrCustomerReviewText') as ratings,
      str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score
  from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1s -njr 3', 'body');

Execute the X-SQL:

val context = SQLContexts.create()
val rs = context.executeQuery(sql)
println(ResultSetFormatter(rs, withHeader = true))

The result is as follows:

TITLE                                                   | BRAND                  | PRICE   | RATINGS       | SCORE
HUAWEI P20 Lite (32GB + 4GB RAM) 5.84" FHD+ Display ... | Visit the HUAWEI Store | $1.9.11 | 1,349 ratings | 4.40

Example code: kotlin.

Click X-SQL to see a detailed introduction and function descriptions about X-SQL.

Usage III: run Pulsar as a REST service

When Pulsar runs as a REST service, X-SQL can be used to scrape webpages or to query the web data directly at anytime, from anywhere, without opening an IDE.

Build from source

git clone https://github.com/platonai/pulsar.git
cd pulsar && bin/build-run.sh

For Chinese developers, we strongly suggest you to follow this instruction to accelerate the building.

Use X-SQL to query the web

Start the pulsar server if not started:

bin/pulsar

Scrape a webpage in another terminal window:

bin/scrape.sh

The bash script is quite simple, just use curl to post an X-SQL:

curl -X POST --location "http://localhost:8182/api/x/e" -H "Content-Type: text/plain" -d "
  select
      dom_base_uri(dom) as url,
      dom_first_text(dom, '#productTitle') as title,
      str_substring_after(dom_first_href(dom, '#wayfinding-breadcrumbs_container ul li:last-child a'), '&node=') as category,
      dom_first_slim_html(dom, '#bylineInfo') as brand,
      cast(dom_all_slim_htmls(dom, '#imageBlock img') as varchar) as gallery,
      dom_first_slim_html(dom, '#landingImage, #imgTagWrapperId img, #imageBlock img:expr(width > 400)') as img,
      dom_first_text(dom, '#price tr td:contains(List Price) ~ td') as listprice,
      dom_first_text(dom, '#price tr td:matches(^Price) ~ td') as price,
      str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score
  from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1d -njr 3', 'body');"

Example code: bash, batch, java, kotlin, php.

The response is as follows in json format:

{
    "uuid": "cc611841-1f2b-4b6b-bcdd-ce822d97a2ad",
    "statusCode": 200,
    "pageStatusCode": 200,
    "pageContentBytes": 1607636,
    "resultSet": [
        {
            "title": "Tara Toys Ariel Necklace Activity Set - Amazon Exclusive (51394)",
            "listprice": "$19.99",
            "price": "$12.99",
            "categories": "Toys & Games|Arts & Crafts|Craft Kits|Jewelry",
            "baseuri": "https://www.amazon.com/dp/B09V3KXJPB"
        }
    ],
    "pageStatus": "OK",
    "status": "OK"
}

Click X-SQL to see a detailed introduction and function descriptions about X-SQL.

Step-by-step course

Logs & Metrics

Pulsar has carefully designed the logging and metrics subsystem to record every event that occurs in the system.

Pulsar logs the status for every load execution, so it’s easy to know what happened in the system, find out answers such as is the system running healthy, how many pages were successfully fetched, how many pages were retried, how many proxy ips were used, etc.

By paying attention to just a few symbols, you can gain insight into the state of the entire system: 💯 💔 🗙 ⚡ 💿 🔃 🤺。

Typical page loading logs are shown below, check log-format to learn how to read the logs and gain insight into the state of the entire system at a glance.

2022-09-24 11:46:26.045  INFO [-worker-14] a.p.p.c.c.L.Task - 3313. 💯 ⚡ U for N got 200 580.92 KiB in 1m14.277s, fc:1 | 75/284/96/277/6554 | 106.32.12.75 | 3xBpaR2 | https://www.walmart.com/ip/Restored-iPhone-7-32GB-Black-T-Mobile-Refurbished/329207863 -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:46:09.190  INFO [-worker-32] a.p.p.c.c.L.Task - 3738. 💯 💿 U  got 200 452.91 KiB in 55.286s, last fetched 9h32m50s ago, fc:1 | 49/171/82/238/6172 | 121.205.220.179 | https://www.walmart.com/ip/Boost-Mobile-Apple-iPhone-SE-2-Cell-Phone-Black-64GB-Prepaid-Smartphone/490934488 -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:46:28.567  INFO [-worker-17] a.p.p.c.c.L.Task - 2269. 💯 🔃 U for SC got 200 565.07 KiB <- 543.41 KiB in 1m22.767s, last fetched 16m58s ago, fc:6 | 58/230/98/295/6272 | 27.158.125.76 | 9uwu602 | https://www.walmart.com/ip/Straight-Talk-Apple-iPhone-11-64GB-Purple-Prepaid-Smartphone/356345388?variantFieldId=actual_color -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:47:18.390  INFO [r-worker-8] a.p.p.c.c.L.Task - 3732. 💔 ⚡ U for N got 1601 0 <- 0 in 32.201s, fc:1/1 Retry(1601) rsp: CRAWL, rrs: EMPTY_0B | 2zYxg52 | https://www.walmart.com/ip/Apple-iPhone-7-256GB-Jet-Black-AT-T-Locked-Smartphone-Grade-B-Used/182353175?variantFieldId=actual_color -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:47:13.860  INFO [-worker-60] a.p.p.c.c.L.Task - 2828. 🗙 🗙 U for SC got 200 0 <- 348.31 KiB <- 684.75 KiB in 0s, last fetched 18m55s ago, fc:2 | 34/130/52/181/5747 | 60.184.124.232 | 11zTa0r2 | https://www.walmart.com/ip/Walmart-Family-Mobile-Apple-iPhone-11-64GB-Black-Prepaid-Smartphone/209201965?athbdg=L1200 -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000

Requirements

  • Memory 4G+

  • Maven 3.2+

  • The latest version of the Java 11 JDK

  • java and jar on the PATH

  • Google Chrome 90+

Pulsar is tested on Ubuntu 18.04, Ubuntu 20.04, Windows 7, Windows 11, WSL, any other operating system that meets the requirements should work as well.

Advanced topics

Check advanced topics to find out the answers for the following questions:

  • What’s so difficult about scraping web data at scale?

  • How to scrape a million product pages from an e-commerce website a day?

  • How to scrape pages behind a login?

  • How to download resources directly within a browser context?

  • How to scrape a single page application (SPA)?

    • Resource mode

    • RPA mode

  • How to make sure all fields are extracted correctly?

  • How to crawl paginated links?

  • How to crawl newly discovered links?

  • How to crawl the entire website?

  • How to simulate human behaviors?

  • How to schedule priority tasks?

  • How to start a task at a fixed time point?

  • How to drop a scheduled task?

  • How to know the status of a task?

  • How to know what’s going on in the system?

  • How to automatically generate the css selectors for fields to scrape?

  • How to extract content from websites using machine learning automatically with commercial accuracy?

  • How to scrape amazon.com to match industrial needs?

Compare with other solutions

In general, the features mentioned in the Feature section are well-supported by Pulsar, but other solutions do not.

Check solution comparison to see the detailed comparison to the other solutions:

  • Pulsar vs selenium/puppeteer/playwright

  • Pulsar vs nutch

  • Pulsar vs scrapy+splash

Technical details

Check technical details to see answers for the following questions:

  • How to rotate my ip addresses?

  • How to hide my bot from being detected?

  • How & why to simulate human behaviors?

  • How to render as many pages as possible on a single machine without be blocked?

About

Scrape web data at scale completely and accurately with high performance, distributed RPA.

License:GNU Affero General Public License v3.0


Languages

Language:Kotlin 72.4%Language:Java 25.7%Language:JavaScript 1.4%Language:Shell 0.5%Language:Rich Text Format 0.1%Language:Batchfile 0.0%Language:PHP 0.0%Language:TypeScript 0.0%