bda-research / node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

queue method should be a promise

CristianMR opened this issue · comments

When skipDuplicates is set to true, the 'drain' event is emitted before it's checked whether queue(uri) has been seen. The solution is quite straightforward: provide a promise to determine whether queue(uri) has been resolved. This would allow us to await c.queue(uri) inside the callback function before calling done().

Crawler.prototype.queue = function queue(options) {
  var self = this;

  // Did you get a single object or string? Make it compatible.
  options = _.isArray(options) ? options : [options];

  options = _.flattenDeep(options);

  const promises = options.map((option) => {
    if (self.isIllegal(option)) {
      log('warn', 'Illegal queue option: ', JSON.stringify(option));
      return;
    }
    return self._pushToQueue(
      _.isString(option) ? { uri: option } : option
    );
  });

  return Promise.all(promises);
};
Crawler.prototype._pushToQueue = function _pushToQueue(options) {
  // ...
  // just return the promise
  return self.seen.exists(options, options.seenreq).then(rst => {
    if (!rst) {
      self._schedule(options);
    }
  }).catch(e => log('error', e));
};

Look, I totally understand, but it'll break the current API using, also if one does not await queue, 'unhandled promise' warning will always be there. What's worse, it to confuse the API when providing promise and callback at the same time. To be hoest, it is better to deduplicate outside the crawler, which means should be handled by the developer. That keeps the flexibility and consistency. Hope it helpful.

Thanks for your answer Mike. I already did it. It took some hours to find out this issue so probably others will do it too. Have a nice year btw ✨

Sorry to hear that you spent hours on this issue, so let's keep the issue details here to help others. Thanks, and the same to you, have a nice year.