Is there a documented way to scrape Single Page Applications?

Question

Is there a documented way to scrape Single Page Applications?

bilogic opened this issue a year ago · comments

SPAs usually pass on a CSRF token for use in subsequent requests, is there a roach way of scraping such sites?

Kai Sassnowski · Answer 1 · Tue Apr 11 2023 17:04:46 GMT+0800 (China Standard Time)

If the CSRF token is part of the page's source, then you can extract it like any other piece of information. You would then have to figure out how exactly the site expects the CSRF to be sent with each subsequent request, for example as a header.

You can then set the header from within your spider before dispatching new requests: https://roach-php.dev/docs/processing-responses#returning-custom-requests

So, assuming the CSRF token exists in the page source like this

<meta name="csrfToken" content="...">

Your parse method could look something like this

public function parse(Response $response): \Generator
{
    // do your scraping here...

    $csrfToken = $response->filter('meta[name="csrfToken"]')->attr('content');

    $request = new Request(
        'POST', 
        'https://next-url-to-crawl.com',
        $this->parse(...),
        // Assuming the csrf token should get passed in the X-CSRF-Token header
        ['headers' => ['X-CSRF-Token' => $csrfToken]],
    );

    yield ParseResult::fromValue($request);
}