Ziinc / crawldis-old

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

app/requestor: Can receive a Request from Web and begin crawling

Ziinc opened this issue · comments

commented

Once a crawl job has started #4 , initial requests from Web will be passed to the Requestor pool leader.

A few ways to handle crawling:

  • Request delegating: The requestor leader handles request delegation, enqueuing request to specific requestors based on their queue sizes.
    • drawbacks: if the node goes down, all enqueued requests for that node is lost
  • Central cache: Requests are stored on a central cache, and this cache is used to store all requests. Requestors then pop off new requests to crawl.
    • drawbacks: if the node goes down, all requests for the cluster is lost
  • Distributed cache: using a crdt, we can have a distributed cache that allows all requestors to store a copy of the cache, allowing data redundancy and fault tolerance.
    • drawbacks: requests may be crawled multiple times, due to the non-atomic replication.
  • Distributed cache with consensus crawling: use a crdt to cache the requests, but use a consensus algo to determine the crawl sequence.
    • drawback: determining consensus would be slow, would not scale for fast crawling requirements.

delta crdt for fast distributed cache syncing looks promising

https://hexdocs.pm/delta_crdt/DeltaCrdt.html#start_link/2

22 Apr Updates

Decided to experiment with DeltaCRDT for data syncing across nodes. We are able to also implement a separate storage mechanism for the CRDT as well, which opens the doors to using ETS for storing data in memory.

  • able to store a request to the req storage worker (currently uses crawly's genserver state storage mechanism)
  • able to fetch the response (using Crawly.Worker.get_response/1). Crawly.Worker performs full-blown request-response processing, which is not what we want for the Requestor's implementation, which should only fetch the response and parse it.
  • #7

Requestor also needs to be able to store the crawl's config.

  • using the same Crawly mental model, we can think of a crawl's config comprising of a spider's start urls, the parsing logic (request and parsed item extraction), and the parsed item processing logic.
  • Technically, if we don't want to process the parsed item, then an empty parsed-item-processing config would skip over the Processor.
  • This means that only the start urls and parsing config is

24 Apr Update

Request queuing requirements: each request needs to be queued in a distributed fashion while ensuring that Requestors do not do duplicate work. as such, good way to minimize overlap is to allow Requestors to "claim" jobs before actually doing any work on them. Requestors can only claim unclaimed requests.

So the lifecycle of a request in the queue is:

unclaimed -> claimed -> popped

When popped from the queue, it no longer appears in the queue.

Internal state held in the crdt is "https://www...." => {:unclaimed, %Request{...}.

  • can queue a request
  • can claim a request
  • can pop a request
  • can replicate the queue across nodes

Middlewares, such as retrying logic, is optional. internally, all crawly middlewares should be optional. Hence we wil omit them for now.

commented

barebones crawling and parsing has been implemented.

more request/fetcher/parser/processor pipeline modules have yet to be fleshed out and implemented, but basic http is working.