roach-php / core

The complete web scraping toolkit for PHP.

Home Page:https://roach-php.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Trying to parse the first page of a paginated result (Call to undefined method Generator::value())

matthiastjong opened this issue · comments

I am trying to scrape a page that has paginated links at the bottom. In the roach docs I have found that you could override the initialRequest to find other URL's to scrape.

This is working as expected:

class ExampleSpider extends BasicSpider
{
    public function parseOverview(Response $response): \Generator
    {
        $pageUrls = array_map(
            function (Link $link) {
                return $link->getUri();
            },
            $response
                ->filter('.pages-items li a')
                ->links(),
            );

        foreach ($pageUrls as $pageUrl) {
            // Since we’re not specifying the second parameter,
            // all article pages will get handled by the
            // spider’s `parse` method.
            yield $this->request('GET', $pageUrl);
        }
    }

    public function parse(Response $response): \Generator
    {
        $items = $response->filter('.product-item')->each(function (Crawler $product, $i) {

            $productName = $product->filter('.product-item-link');
            $array['product_name'] = $productName->count() ? $productName->text() : null;

            $link = $product->filter('.product-item-link');
            $array['link'] = $link->count() ? $link->link()->getUri() : null;

            $imageUrl = $product->filter('.product-image-photo');
            $array['image_url'] = $imageUrl->count() ? $imageUrl->image()->getUri() : null;

            $salePrice = $product->filter('.price-final_price .price');
            $array['sale_price'] = $salePrice->count() ? $salePrice->text() : null;

            $regularPrice = $product->filter('.old-price span.price');
            $array['regular_price'] = $regularPrice->count() ? $regularPrice->text() : null;

            $attributeSize = $product->filter('.attribute.size');
            $array['attribute_size'] = $attributeSize->count() ? $attributeSize->text() : null;

            $savings = $product->filter('.sticker-wrapper');
            $array['savings'] = $savings->count() ? $savings->text() : null;

            return $array;
        });

        foreach ($items as $item) {
            if (!$item) {
                continue;
            }
            yield $this->item($item);
        }
    }

    /** @return Request[] */
    protected function initialRequests(): array
    {
        return [
            new Request(
                'GET',
                'https://www.example.com/5-pages', // Has 5 pages
                [$this, 'parseOverview']
            ),
            new Request(
                'GET',
                'https://www.example.com/1-page', // Has 1 page (no pagination)
                [$this, 'parseOverview']
            ),
        ];
    }
}

However, this only scrapes the pages that are gathered parseOverview() method. I would also like to use the $response object from the first page (https://www.example.com/5-pages) and not only:

  1. https://www.example.com/5-pages?page=2
  2. https://www.example.com/5-pages?page=3
  3. https://www.example.com/5-pages?page=4
  4. https://www.example.com/5-pages?page=5

So I figured, as we have the first page already in the Response, I'll try running the $this->parse() method on the $response object in the parseOverview() method:

public function parseOverview(Response $response): \Generator
    {
        yield $this->parse($response); // Here I try yielding the parse() method using the response object from the first page

        $pageUrls = array_map(
            function (Link $link) {
                return $link->getUri();
            },
            $response
                ->filter('.pages-items li a')
                ->links(),
            );

        foreach ($pageUrls as $pageUrl) {
            // Since we’re not specifying the second parameter,
            // all article pages will get handled by the
            // spider’s `parse` method.
            yield $this->request('GET', $pageUrl);
        }
    }

However, when running the Spider I get the following error: Call to undefined method Generator::value()

I tried adding the first page url to the array $pageUrls, but then I get a DuplicatedRequest. This is good because I do not want to fire the request twice when we already have a working Response object.

What do you recommend to change to make sure I get the data of the first page also?

The issue is probably this line right here

yield $this->parse($response); // Here I try yielding the parse() method using the response object from the first page

$this->parse(...) already returns a Generator so all you’re doing is yielding it again, essentially returning a Generator<int, Generator<int, ParseResult>>. What you actually want to do is yield from that generator, instead of yielding the Generator itself.

public function parseOverview(Response $response): \Generator
{
-    yield $this->parse($response);
+    yield from $this->parse($response);

Does that solve your problem?

Thats it! thanks.