bda-research / node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-)

Home Page:http://node-crawler.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can this tool actually crawl/spider, or just scrape pages?

bogdancss opened this issue · comments

Hey,

I may not fully understand these terms, but can this tool actually crawl/spider all the pages under a domain, or does it just scrape a specific url?

When I say crawl/spider, I am thinking of something like the ScreamingFrog Spider tool, where you can provide an url, and it will find all (most) other pages on that site.

Please feel free to close this issue, but I feel the tool description needs to be a bit more clear.

Thanks

I agree, node-scraper would be a more fitting name for this tool. Or is there an easy configuration to make it behave like an actual crawler?

Yes, "scraper" should be much better. But never mind, you may implement a spider by yourself based on this.

But never mind, you may implement a spider by yourself based on this.

how do you do this?

  1. Figure out the home page or entrance URL which is good to start;
  2. Send request to the URL(s);
  3. Parse the page content that you get from the response to get all the URLs you care about, which may be the same domain as the previous one;
  4. Save the page content to a file or Db whatever you want;
  5. Repeat from step 2 to end.

solved in #420