Add support for concurrent invocations to crawl
adityak74 opened this issue · comments
PlaywrightCrawler creates a lock file and fails when the crawl is invoked concurrently. There is a property running
for the class through which we can validate if there is an instance running. We should spawn a process with Playwright to resolve the crawl job.
Hey @adityak74,
I made some attempts to solve this problem, but I was not successful.
Any progress on your side?
@isarikaya can you add some details on what you tried out? It will help me to investigate. But I haven't found a solution yet.
@adityak74 As far as I remember I tried the following:
-Storage is created after the first request. A new request doesn't matter if there is storage. So I tried clearing after the request completed.
https://crawlee.dev/docs/guides/request-storage#cleaning-up-the-storages
https://stackoverflow.com/questions/74709844/how-to-reset-crawlee-url-cache
-maxRequestsPerCrawl
https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#maxRequestsPerCrawl
I think here are the things that will do what we want so I'll try to integrate them into the existing code.
https://crawlee.dev/docs/guides/parallel-scraping