Is there a way to stop spider check duplicate with redis ?

Question

Is there a way to stop spider check duplicate with redis ?

milkeasd opened this issue 3 years ago · comments

My spider was extremely slow when run with scrapy-redis. Because there is a big delay between slave and master. I want to reduce the commuication to just only getting the start_urls periodically or when all start_urls is done, Is there any ways to do so ?

Moreover, I want to stop the duplication check to reduce the number of connection.

But, I cant change the DUPEFILTER_CLASS to scrapy default one, it raise error.

Is there any other ways to stop the duplicate check ?

Or any ideas can help speed up the process ?

Thanks

Jeremy Chou · Answer 1 · Sun Apr 03 2022 10:56:59 GMT+0800 (China Standard Time)

@Germey Any ideas?

Jeremy Chou · Answer 2 · Sun Apr 03 2022 11:01:28 GMT+0800 (China Standard Time)

@milkeasd
Could you provide related code files?

Jeremy Chou · Answer 3 · Sun Apr 03 2022 13:21:58 GMT+0800 (China Standard Time)

The way I see, let developer customize their communication rules and add a disable option for DUPEFILTER_CLASS can be two great features.

Jeremy Chou · Answer 4 · Sat Apr 09 2022 00:52:04 GMT+0800 (China Standard Time)

@milkeasd
For disable DUPEFILTER_CLASS, try this https://stackoverflow.com/questions/23131283/how-to-force-scrapy-to-crawl-duplicate-url

Germey · Answer 5 · Sat Apr 09 2022 14:11:51 GMT+0800 (China Standard Time)

@milkeasd could you please provide your code or make some sample code?