roach-php / core

The complete web scraping toolkit for PHP.

Home Page:https://roach-php.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Xpath not working

Benoit1980 opened this issue · comments

Hello,

I was trying to test your library but cannot get a basic xpath working on Google(as an example).

    public function parse(Response $response): Generator
    {
        $html = $response->filterXpath('//div[contains(@id, "center_col")]')->each(function (Crawler $node) {
            return $node->text();
        });
        yield $this->item([
            'html' => $html,
        ]);
    }

It returns and empty array and this is strange because there is a "

" in the page.

Any idea why it is not working please?

Thank yyou.

What URL were your crawling exactly? Because running the same XPath query on https://google.com in the browser console also doesn't return any results

Hello,

Just a general result page on Google.

Thank you.

That's because by default, Google displays the search result via Javascript. You can verify this by opening the search results page with Javascript disabled in your browser. You will get redirect to a static HTML version of the page after a few seconds. This means the HTML your spider sees actually doesn't include a center_col div.

In order for this to work properly, you would have to use the ExecuteJavascriptMiddleware in your spider. This isn't possible to do in the interactive shell though.

Also note that sometimes websites implement certain anti-scraping measures. So what you see in the browser and what your spider sees might not necessarily be the same thing.

Thank you so much for the explanation.