roach-php / core

The complete web scraping toolkit for PHP.

Home Page:https://roach-php.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parse method is called even on response status 404

rabol opened this issue · comments

Describe the bug
The parse() method of a spider is called even if a wrong url is passed in and the response result code is 404.

Expected behavior
that the pasing is stopped in case of not status 200.

Package versions (please complete the following information):

roach-php/laravel v3.1.0

Additional context
Spider class:

<?php

namespace App\Spiders;

use Generator;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Extensions\LoggerExtension;
use RoachPHP\Extensions\StatsCollectorExtension;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Spider\ParseResult;
use Symfony\Component\DomCrawler\Crawler;

class BBCGoodFood extends BasicSpider
{
    public array $startUrls = [
        //
    ];

    public array $downloaderMiddleware = [
        RequestDeduplicationMiddleware::class,
    ];

    public array $spiderMiddleware = [
        //
    ];

    public array $itemProcessors = [
        //
    ];

    public array $extensions = [
        LoggerExtension::class,
        StatsCollectorExtension::class,
    ];

    public int $concurrency = 2;

    public int $requestDelay = 1;

    /**
     * @return Generator<ParseResult>
     */
    public function parse(Response $response): Generator
    {
        yield $this->item([
            'title' => $response->filterXPath('//div[@class="headline post-header__title post-header__title--masthead-layout"]/h1[@class="heading-1"]')->text(),
            'ingredients' => $response->filterXPath('//section[contains(@class, "recipe__ingredients" )]/section/ul[@class="list"]/li')->each(
                fn (Crawler $entry) => $entry->text()),
            'instructions' => $response->filterXPath('//section[contains(@class, "recipe__method-steps" )]/*/*/li')->each(
                fn (Crawler $entry) => $entry->text()),

        ]);
    }
}

started like this:

        $result = Roach::collectSpider(
            BBCGoodFood::class,
            new Overrides(startUrls: ['https://example.com/does-not-exists']),
        );

this will give you a The current node list is empty. error

Thanks for pointing this out, this is something I've been meaning to get around to since forever. I just released a new version which adds a HttpErrorMiddleware that automatically drops responses outside the 200-300 range. This middleware is enabled by default if your spider extends BasicSpider (provided you're not overriding the $downloaderMiddleware property).

Here are the relevant docs which also explain the available configuration options: https://roach-php.dev/docs/downloader-middleware#handling-http-errors

Thanks:

small note: the stub that is used when one create a Spider, should not 'override' the BaseSpider properties by default.

e.g.

in my case I now have to double check all spiders because the BasicSpider now have this:

    public array $downloaderMiddleware = [
        RequestDeduplicationMiddleware::class,
        HttpErrorMiddleware::class,
    ];

and my Spider created by roach have this:

public array $downloaderMiddleware = [
        RequestDeduplicationMiddleware::class,
      ];