Testing how a spider scrapes a given HTML file

Question

Testing how a spider scrapes a given HTML file

seb-jones opened this issue 2 years ago · comments

Hello there,

Just a question. Is there a simple way to feature test a spider by giving it some HTML and inspecting what it returns, e.g. making assertions against what would be returned by collectSpider.

Many thanks

Seb

Kai Sassnowski · Answer 1 · Sat Jul 02 2022 13:26:53 GMT+0800 (China Standard Time)

I'm afraid there isn't nice way to do this at the moment but it's something I will probably add in the future.

Seb Jones · Answer 2 · Sat Jul 02 2022 15:09:49 GMT+0800 (China Standard Time)

Cool cool, thanks for the response :)

Seb Jones · Answer 3 · Sun Jul 03 2022 01:39:47 GMT+0800 (China Standard Time)

For what it's worth, I've managed to implement a fairly simple, albeit inelegant, way to do these kind of tests in the meantime. It works by firing up a PHP dev server and pointing the spider to that URL by overriding the startUrls. Thought I'd share the code here in case it was useful to anyone:

$serverProcess = null;

beforeAll(function () {
    global $serverProcess;
    $serverProcess = proc_open('cd resources/html && php -S localhost:8123', [], $pipes);
});

it('scrapes an html page', function () {
    $scrapedItems = Roach::collectSpider(
        MySpider::class,
        new Overrides(startUrls: ['http://localhost:8123']),
    );

    // do some assertions on $scrapedItems
});

afterAll(function () {
    global $serverProcess;
    proc_terminate($serverProcess);
});

The above assumes that there is an index.html file in resources/html.

I imagine there's probably a nicer way to do it, but this seems to be working right now.

Kai Sassnowski · Answer 4 · Sun Jul 03 2022 02:54:08 GMT+0800 (China Standard Time)

FYI, I've already started working on testing helpers for this. https://twitter.com/warsh33p/status/1543150150205538304

Shouldn't take too much longer.

Seb Jones · Answer 5 · Sun Jul 03 2022 04:03:58 GMT+0800 (China Standard Time)

Nice! I look forward to trying them out.