roach-php / core

The complete web scraping toolkit for PHP.

Home Page:https://roach-php.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

interactive shell vs real code

xciser77 opened this issue · comments

I was trying roach-php in a laravel project. When I try a filter in the interactive shell I get the data I want.
But if I you the same filter in my spider file, I don;t get the data.

Looks like the interactive shell gets the remote date different, because if I dd the return array in laravel and look at the remote html data, there is less information available then in the interactive shell.

Can I use config values to get the same results in the real code vs interactive shell ?

@xciser77,
I didn't have that. But perhaps the site uses post JS render.

Try using ExecuteJavascriptMiddleware in your code:
https://roach-php.dev/docs/downloader-middleware/#executing-javascript

i am trying, do I only have to include use RoachPHP\Downloader\Middleware\ExecuteJavascriptMiddleware
in my spider or do I have to declare something in the downloader middleware also ?

Do you have an example repository illustrating this issue? Because the interactive shell uses the same mechanism to download a site's HTML as a spider does.

if got this link (https://www.douglas.nl/nl/p/5009960042) and I am trying to get the prizes.

`<?php

namespace App\Spiders;

use Generator;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Downloader\Middleware\ExecuteJavascriptMiddleware;
use RoachPHP\Extensions\LoggerExtension;
use RoachPHP\Extensions\StatsCollectorExtension;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Spider\ParseResult;

class douglasnl extends BasicSpider
{
public array $startUrls = [
'https://www.douglas.nl/nl/p/5009960042'
];

public array $downloaderMiddleware = [
    RequestDeduplicationMiddleware::class,
];

public array $spiderMiddleware = [
    //
];

public array $itemProcessors = [
    //
];

public array $extensions = [
    LoggerExtension::class,
    StatsCollectorExtension::class,
];

public int $concurrency = 1;

public int $requestDelay = 2;

/**
 * @return Generator<ParseResult>
 */
public function parse(Response $response): Generator
{
  
   $product_id = $response->filterXpath('//link[@rel="canonical"]')->link();  
   $prizes =  $response->filterXpath('//div[@class="product-detail__variant-row product-detail__variant-row--spread-content"]')->eq(0);

   yield $this->item([
    'product_id' => $product_id,
    'prizes' => $prizes
  ]);
}

}
`

If I try the response filterXpath ($response->filterXpath('//div[@Class="product-detail__variant-row product-detail__variant-row--spread-content"]')->eq(0);) in the interactive shell, I get two results, in my laravel app I get an error ( 0 => "Line 175, Col 44974: No match in entity table for 'Gabbana'")

Roach::startSpider(douglasnl::class); $items = Roach::collectSpider(douglasnl::class);

So I've tested this locally and you actually get back the same result both times. The difference is that the REPL processes the raw HTML a little bit before showing the results.

The issue is that you're yielding the entire Crawler object in your spider instead of just the string contents of the node. So your parse method should look something like this instead:

/**
 * @return Generator<ParseResult>
 */
public function parse(Response $response): Generator
{
    $product_id = $response->filterXpath('//link[@rel="canonical"]')
        ->link()
        // Return the actual URI string instead of the `Link` object.
        ->getUri();

    $prizes =  $response
        ->filterXpath('//div[@class="product-detail__variant-row product-detail__variant-row--spread-content"]')
        ->eq(0)
         // Return the actual text contents of the node instead of the entire
         // `Crawler` object. 
        ->text();

    yield $this->item([
        'product_id' => $product_id,
        'prizes' => $prizes
   ]);
}

Another thing is that you should call either Roach::startSpider or Roach::collectSpider but not both since that would actually cause the spider to run twice. Roach::collectSpider already starts the spider.