interactive shell vs real code

Question

interactive shell vs real code

xciser77 opened this issue 2 years ago · comments

I was trying roach-php in a laravel project. When I try a filter in the interactive shell I get the data I want.
But if I you the same filter in my spider file, I don;t get the data.

Looks like the interactive shell gets the remote date different, because if I dd the return array in laravel and look at the remote html data, there is less information available then in the interactive shell.

Can I use config values to get the same results in the real code vs interactive shell ?

Golubev Alexey · Answer 1 · Sun Oct 02 2022 16:01:15 GMT+0800 (China Standard Time)

@xciser77,
I didn't have that. But perhaps the site uses post JS render.

Try using ExecuteJavascriptMiddleware in your code:
https://roach-php.dev/docs/downloader-middleware/#executing-javascript

xciser77 · Answer 2 · Sat Oct 08 2022 02:58:41 GMT+0800 (China Standard Time)

i am trying, do I only have to include use RoachPHP\Downloader\Middleware\ExecuteJavascriptMiddleware
in my spider or do I have to declare something in the downloader middleware also ?

Kai Sassnowski · Answer 3 · Sat Oct 08 2022 18:29:51 GMT+0800 (China Standard Time)

Do you have an example repository illustrating this issue? Because the interactive shell uses the same mechanism to download a site's HTML as a spider does.

xciser77 · Answer 4 · Sat Oct 08 2022 20:43:24 GMT+0800 (China Standard Time)

if got this link (https://www.douglas.nl/nl/p/5009960042) and I am trying to get the prizes.

`<?php

namespace App\Spiders;

use Generator;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Downloader\Middleware\ExecuteJavascriptMiddleware;
use RoachPHP\Extensions\LoggerExtension;
use RoachPHP\Extensions\StatsCollectorExtension;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Spider\ParseResult;

class douglasnl extends BasicSpider
{
public array $startUrls = [
'https://www.douglas.nl/nl/p/5009960042'
];

public array $downloaderMiddleware = [
    RequestDeduplicationMiddleware::class,
];

public array $spiderMiddleware = [
    //
];

public array $itemProcessors = [
    //
];

public array $extensions = [
    LoggerExtension::class,
    StatsCollectorExtension::class,
];

public int $concurrency = 1;

public int $requestDelay = 2;

/**
 * @return Generator<ParseResult>
 */
public function parse(Response $response): Generator
{
  
   $product_id = $response->filterXpath('//link[@rel="canonical"]')->link();  
   $prizes =  $response->filterXpath('//div[@class="product-detail__variant-row product-detail__variant-row--spread-content"]')->eq(0);

   yield $this->item([
    'product_id' => $product_id,
    'prizes' => $prizes
  ]);
}

}
`

If I try the response filterXpath ($response->filterXpath('//div[@Class="product-detail__variant-row product-detail__variant-row--spread-content"]')->eq(0);) in the interactive shell, I get two results, in my laravel app I get an error ( 0 => "Line 175, Col 44974: No match in entity table for 'Gabbana'")

Roach::startSpider(douglasnl::class); $items = Roach::collectSpider(douglasnl::class);

Kai Sassnowski · Answer 5 · Wed Nov 02 2022 18:10:26 GMT+0800 (China Standard Time)

So I've tested this locally and you actually get back the same result both times. The difference is that the REPL processes the raw HTML a little bit before showing the results.

The issue is that you're yielding the entire Crawler object in your spider instead of just the string contents of the node. So your parse method should look something like this instead:

/**
 * @return Generator<ParseResult>
 */
public function parse(Response $response): Generator
{
    $product_id = $response->filterXpath('//link[@rel="canonical"]')
        ->link()
        // Return the actual URI string instead of the `Link` object.
        ->getUri();

    $prizes =  $response
        ->filterXpath('//div[@class="product-detail__variant-row product-detail__variant-row--spread-content"]')
        ->eq(0)
         // Return the actual text contents of the node instead of the entire
         // `Crawler` object. 
        ->text();

    yield $this->item([
        'product_id' => $product_id,
        'prizes' => $prizes
   ]);
}

Another thing is that you should call either Roach::startSpider or Roach::collectSpider but not both since that would actually cause the spider to run twice. Roach::collectSpider already starts the spider.