cuonghuynh / crawler

Crawl all links found on a website

Home Page:https://murze.be/2015/11/building-a-crawler-in-php/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crawl links on a website

Latest Version on Packagist Software License Build Status SensioLabsInsight Quality Score StyleCI Total Downloads

This package provides a class to crawl links on a website.

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Postcardware

You're free to use this package (it's MIT-licensed), but if it makes it to your production environment you are required to send us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Samberstraat 69D, 2060 Antwerp, Belgium.

The best postcards will get published on the open source page on our website.

Installation

This package can be installed via Composer:

composer require spatie/crawler

Usage

The crawler can be instantiated like this

Crawler::create()
    ->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setCrawlObserver must be an object that implements the \Spatie\Crawler\CrawlObserver interface:

/**
 * Called when the crawler will crawl the given url.
 *
 * @param \Spatie\Crawler\Url $url
 */
public function willCrawl(Url $url);

/**
 * Called when the crawler has crawled the given url.
 *
 * @param \Spatie\Crawler\Url       $url
 * @param \Psr\Http\Message\ResponseInterface $response
 */
public function hasBeenCrawled(Url $url, ResponseInterface $response);

/**
 * Called when the crawl has ended.
 */
public function finishedCrawling();

Filtering certain url's

You can instruct the crawler not to visit certain url's by using the setCrawlProfile method. It expects an object that implements the Spatie\Crawler\CrawlProfile interface:

/**
 * Set the crawl profile.
 *
 * @param \Spatie\Crawler\CrawlProfile $crawlProfile
 *
 * @return $this
 */
public function setCrawlProfile(CrawlProfile $crawlProfile)
{
    $this->crawlProfile = $crawlProfile;
    return $this;
}

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Security

If you discover any security related issues, please email freek@spatie.be instead of using the issue tracker.

Credits

About Spatie

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

License

The MIT License (MIT). Please see License File for more information.

About

Crawl all links found on a website

https://murze.be/2015/11/building-a-crawler-in-php/

License:MIT License


Languages

Language:PHP 100.0%