mjorgens / web-crawler

This is a PHP library that takes a starting URL and then parses the page Html and extracts the URLs. It then follows the URL and parses those pages until the max number of URLs is reached.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Web Crawler for PHP

GitHub release (latest by date) GitHub Workflow Status (branch) GitHub

This is a PHP library that takes a starting URL and then parses the page Html and extracts the URLs. It then follows the URL and parses those pages until the max number of URLs is reached.

Requirements

PHP from Packagist

Installation

The recommended way to install this library is through Composer.

composer require mjorgens/web-crawler

Usage

$repository = new \Mjorgens\Crawler\CrawledRepository\CrawledMemoryRepository(); // The collection of pages
$url = new Uri('https://example.com'); // Starting url
$maxUrls = 5; // Max number of urls to crawl

Crawler::create()
            ->setRepository($repository)
            ->setMaxCrawl($maxUrls)
            ->startCrawling($url); // Start the crawler

foreach ($repository as $page){
    echo $page->url;
    echo $page->html;
}

About

This is a PHP library that takes a starting URL and then parses the page Html and extracts the URLs. It then follows the URL and parses those pages until the max number of URLs is reached.

License:MIT License


Languages

Language:PHP 100.0%