BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL

Home Page:https://www.builder.io/blog/custom-gpt

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for concurrent invocations to crawl

adityak74 opened this issue · comments

PlaywrightCrawler creates a lock file and fails when the crawl is invoked concurrently. There is a property running for the class through which we can validate if there is an instance running. We should spawn a process with Playwright to resolve the crawl job.

@BikeF

Hey @adityak74,

I made some attempts to solve this problem, but I was not successful.
Any progress on your side?

@isarikaya can you add some details on what you tried out? It will help me to investigate. But I haven't found a solution yet.

@adityak74 As far as I remember I tried the following:

-Storage is created after the first request. A new request doesn't matter if there is storage. So I tried clearing after the request completed.
https://crawlee.dev/docs/guides/request-storage#cleaning-up-the-storages
https://stackoverflow.com/questions/74709844/how-to-reset-crawlee-url-cache

-maxRequestsPerCrawl
https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#maxRequestsPerCrawl

I think here are the things that will do what we want so I'll try to integrate them into the existing code.
https://crawlee.dev/docs/guides/parallel-scraping