Add support for concurrent invocations to crawl

Question

Add support for concurrent invocations to crawl

adityak74 opened this issue 9 months ago · comments

PlaywrightCrawler creates a lock file and fails when the crawl is invoked concurrently. There is a property running for the class through which we can validate if there is an instance running. We should spawn a process with Playwright to resolve the crawl job.

@BikeF

İbrahim Sarıkaya · Answer 1 · Tue Jan 23 2024 15:42:13 GMT+0800 (China Standard Time)

Hey @adityak74,

I made some attempts to solve this problem, but I was not successful.
Any progress on your side?

Aditya Karnam · Answer 2 · Tue Jan 23 2024 15:51:07 GMT+0800 (China Standard Time)

@isarikaya can you add some details on what you tried out? It will help me to investigate. But I haven't found a solution yet.

İbrahim Sarıkaya · Answer 3 · Tue Jan 23 2024 17:09:50 GMT+0800 (China Standard Time)

@adityak74 As far as I remember I tried the following:

-Storage is created after the first request. A new request doesn't matter if there is storage. So I tried clearing after the request completed.
https://crawlee.dev/docs/guides/request-storage#cleaning-up-the-storages
https://stackoverflow.com/questions/74709844/how-to-reset-crawlee-url-cache

-maxRequestsPerCrawl
https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#maxRequestsPerCrawl

I think here are the things that will do what we want so I'll try to integrate them into the existing code.
https://crawlee.dev/docs/guides/parallel-scraping