roach-php / core

The complete web scraping toolkit for PHP.

Home Page:https://roach-php.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is there a documented way to scrape Single Page Applications?

bilogic opened this issue · comments

SPAs usually pass on a CSRF token for use in subsequent requests, is there a roach way of scraping such sites?

If the CSRF token is part of the page's source, then you can extract it like any other piece of information. You would then have to figure out how exactly the site expects the CSRF to be sent with each subsequent request, for example as a header.

You can then set the header from within your spider before dispatching new requests: https://roach-php.dev/docs/processing-responses#returning-custom-requests

So, assuming the CSRF token exists in the page source like this

<meta name="csrfToken" content="...">

Your parse method could look something like this

public function parse(Response $response): \Generator
{
    // do your scraping here...

    $csrfToken = $response->filter('meta[name="csrfToken"]')->attr('content');

    $request = new Request(
        'POST', 
        'https://next-url-to-crawl.com',
        $this->parse(...),
        // Assuming the csrf token should get passed in the X-CSRF-Token header
        ['headers' => ['X-CSRF-Token' => $csrfToken]],
    );

    yield ParseResult::fromValue($request);
}