Can this tool actually crawl/spider, or just scrape pages?

Question

Can this tool actually crawl/spider, or just scrape pages?

bogdancss opened this issue 3 years ago · comments

Hey,

I may not fully understand these terms, but can this tool actually crawl/spider all the pages under a domain, or does it just scrape a specific url?

When I say crawl/spider, I am thinking of something like the ScreamingFrog Spider tool, where you can provide an url, and it will find all (most) other pages on that site.

Please feel free to close this issue, but I feel the tool description needs to be a bit more clear.

Thanks

Paul Kretschel · Answer 1 · Sat Oct 30 2021 19:46:08 GMT+0800 (China Standard Time)

I agree, node-scraper would be a more fitting name for this tool. Or is there an easy configuration to make it behave like an actual crawler?

Mike Chen · Answer 2 · Mon Nov 08 2021 16:31:35 GMT+0800 (China Standard Time)

Yes, "scraper" should be much better. But never mind, you may implement a spider by yourself based on this.

Raquel Smith · Answer 3 · Wed Nov 17 2021 04:35:01 GMT+0800 (China Standard Time)

But never mind, you may implement a spider by yourself based on this.

how do you do this?

Mike Chen · Answer 4 · Fri Nov 26 2021 16:38:29 GMT+0800 (China Standard Time)

Figure out the home page or entrance URL which is good to start;
Send request to the URL(s);
Parse the page content that you get from the response to get all the URLs you care about, which may be the same domain as the previous one;
Save the page content to a file or Db whatever you want;
Repeat from step 2 to end.

Mike Chen · Answer 5 · Tue Jul 19 2022 16:35:26 GMT+0800 (China Standard Time)

solved in #420