Can you do child requests?

Question

Can you do child requests?

kevinrademan opened this issue 9 years ago · comments

I was wondering if you had considered allowing "sub-requests" per url.

I hit page A to get all product data including the product id
Then I need to hit page B to get the product availability data using the product id from page A.

Ideally I'd like the availability to be included in the response from page A.

Would something like this be possible?

Guillaume Plique · Answer 1 · Wed Dec 16 2015 22:23:30 GMT+0800 (China Standard Time)

You cannot do it per se but you can still emulate this by building your data in the result callback nonetheless. Typically, people use an external data variable holding the scraped data and building it while crawling. Let's say this data variable is an object with keys being your product id, based on urls or arbitrary data you pass to you jobs, you could very well complete your data within the result callback.

I am not sure I make sense. Tell me if you understand what I mean. If not, I'll write an example 😄.

Kevin · Answer 2 · Wed Dec 16 2015 22:35:22 GMT+0800 (China Standard Time)

Yep that makes perfect sense. I'm busy changing my code now. I did also find a dirty workaround (for testing only).

The idea is basically that you create a nested scraper in the scrape callback. This scraper then hits the 2nd url and calls the parent scrapers "done" method in the result callback.

var sandcrawler = require('sandcrawler')

var spider = sandcrawler.spider()
    //.use(dashboard())
    .url('http://urlgoeshere/de/product/product.html')
    .scraper(function($, done) {
        var data = {
            id: $("[name=somename]").data("catentryid"),
            attributes: $('#features section').scrape({
                group: {
                    sel: 'h2',
                    method: 'html'
                },
                items: function() {
                    return $(this).find('dt').scrape({
                        title: 'text',
                        value: function() {
                            return $(this).next().html()
                        }
                    });
                }
            })
        };

        sandcrawler.spider()
            //.use(dashboard())
            .url({
                url: 'http://urlgoeshere/avdata?catEntryId=' + data.id
            })
            .result(function(err, req, res) {
                data.offers = JSON.parse(res.body).markets;
                done(null, data);
            })
            .run(function(err, remains) {
                console.log('And we are done!');
            });

    })
    .result(function(err, req, res) {
        console.log('Scraped data:', JSON.stringify(res.data));
    })
    .run(function(err, remains) {
        //console.log('And we are done!');
    });

Guillaume Plique · Answer 3 · Wed Dec 16 2015 23:43:43 GMT+0800 (China Standard Time)

Yes this works also. While I'll admit it feels a bit convoluted. This won't work however if you are using a phantom spider.

Guillaume Plique · Answer 4 · Wed Dec 16 2015 23:44:46 GMT+0800 (China Standard Time)

May I ask you, if it is not indiscreet, how you found this library?

Kevin · Answer 5 · Thu Dec 17 2015 00:48:37 GMT+0800 (China Standard Time)

I'be currently looking into a few different scraping frameworks and found it on https://www.npmjs.com/package/sandcrawler