medialab / sandcrawler

sandcrawler.js - the server-side scraping companion.

Home Page:http://medialab.github.io/sandcrawler/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can you do child requests?

kevinrademan opened this issue · comments

commented

I was wondering if you had considered allowing "sub-requests" per url.

I hit page A to get all product data including the product id
Then I need to hit page B to get the product availability data using the product id from page A.

Ideally I'd like the availability to be included in the response from page A.

Would something like this be possible?

You cannot do it per se but you can still emulate this by building your data in the result callback nonetheless. Typically, people use an external data variable holding the scraped data and building it while crawling. Let's say this data variable is an object with keys being your product id, based on urls or arbitrary data you pass to you jobs, you could very well complete your data within the result callback.

I am not sure I make sense. Tell me if you understand what I mean. If not, I'll write an example 😄.

commented

Yep that makes perfect sense. I'm busy changing my code now. I did also find a dirty workaround (for testing only).

The idea is basically that you create a nested scraper in the scrape callback. This scraper then hits the 2nd url and calls the parent scrapers "done" method in the result callback.

var sandcrawler = require('sandcrawler')

var spider = sandcrawler.spider()
    //.use(dashboard())
    .url('http://urlgoeshere/de/product/product.html')
    .scraper(function($, done) {
        var data = {
            id: $("[name=somename]").data("catentryid"),
            attributes: $('#features section').scrape({
                group: {
                    sel: 'h2',
                    method: 'html'
                },
                items: function() {
                    return $(this).find('dt').scrape({
                        title: 'text',
                        value: function() {
                            return $(this).next().html()
                        }
                    });
                }
            })
        };

        sandcrawler.spider()
            //.use(dashboard())
            .url({
                url: 'http://urlgoeshere/avdata?catEntryId=' + data.id
            })
            .result(function(err, req, res) {
                data.offers = JSON.parse(res.body).markets;
                done(null, data);
            })
            .run(function(err, remains) {
                console.log('And we are done!');
            });

    })
    .result(function(err, req, res) {
        console.log('Scraped data:', JSON.stringify(res.data));
    })
    .run(function(err, remains) {
        //console.log('And we are done!');
    });

Yes this works also. While I'll admit it feels a bit convoluted. This won't work however if you are using a phantom spider.

May I ask you, if it is not indiscreet, how you found this library?

commented

I'be currently looking into a few different scraping frameworks and found it on https://www.npmjs.com/package/sandcrawler