ipni / index-provider

📢 Index Provider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create a new interface to query sync logs for new IPNI-sync

LexLuthr opened this issue · comments

We are replacing the Graphsync with HTTP-libp2p. Boost tracks all the retrievals for IPNI and presents them in a UI page. This helps SP get an overview of their IPNi sync. We need something similar for the new HTTP-libp2p sync to avoid feature regression in Boost.

CC: @davidd8 @TorfinnOlsen for visibility

Can you point me to a description of what needs to be present in the logs, or the API you want emulated?

Is this something that the index-provider can provide, or does the ingestion need to be verified against one or more indexers?

Can this be queried by using the ipni-cli utility? All information on the provider’s ad chain as well as the indexer’s ingestion of it, is available through the ipni-cli.

When sync was done using Graphsync (default), it would emit events like new request, progress and complete. We could save these events in a DB and display them as logs in UI.
https://github-production-user-asset-6210df.s3.amazonaws.com/88259624/264365716-47f22eae-0b0b-46c9-aab1-84a6e592d476.png

After switching to IPNI sync, these events don't exist anymore. So, it becomes hard for an SP to understand if his Boost is syncing with Indexers or not. We want an API that we can query to get similar events. We don't necessarily need to use the same event emitter.

I am not sure that ipni-sync events will exactly translate to what graphsync was providing (or if that was even that accurate when used with IPNI).

For exmaple, ipni makes a separate HTTP request for each advertisement in the chain of ads, until it get to one it has already processed. After the ads are synced then ipni makes a separate request for each entry chunk in each ad, if that ad has multihash chunks. If ipni already has the ad in its CAR mirror, then a request for the ad's entry chunks is not sent to the provider. So, if trying to show ingestion in progress, the provider may not have a great picture overall.

For observing ingesting the data in a single advertisement, the index-provider could log some event for each multihash block. To make that useful, this index provider would need to know 1) which ad each block is associated with, and 2) the total number of blocks in the ad, so that an indication of progress could be given.

  1. May be difficult since the requests for multihash blocks are completely separate from requests for ads, and may come at completely separate times. That means the index provider would need to keep some database of block CIDs mapped to their ad CID.
  2. May not be practical because it requires reading the advertisement entries blocks to count them, and requires 1.

Would it be useful to log the time that a request was received for an advertisement or entries data? If so, how will the index-provider know whether it is an indexer, or just some other utility crawling the provider's ad chain? Does the index-provider know what the source address for the indexer is? If this is coming over libp2p then we could probably look at the peerID, but not if this is over plain http.

If the goal is just to show when an indexer is making requests, that may be easy to do, but probably the best indication of indexing activity is the provider information that the GUI is already pulling from /providers/<provider_id>. Watching for changes in the LastAdvertisement field is obviously useful. Also, the Lag field tells if there is indexing actively occuring and then number indicates how many ads are left to process. The LastError and LastErrorTime fields may be useful to indicate if/when there was a problem that is blocking indexing.

I think we can get last error and time from indexer side. It should be fairly easy to display. For the sync itself and tracking the lag, I have some questions.

  1. Do we have a readily available Lag number on indexer side? It should require no computation.
  2. If not, then I think it would make more sense to walk the ad chain locally on provider side and generate a lag based on what is latest on the indexer side. Does that make sense?
  1. Yes, and you already get it as a field in the provider information. The lag field is only present during ingestion-in-progress and shows the progress of that ingestion (counting down to 0).
  2. That can be done also, and may be useful when the lag is not available. Such a situation could arise if the indexer is not doing ingestion due to some error, but has in the past. Then a distance can be calculated between the last advertisement seen by the indexer and the most recent advertisement on the ad chain. The ipni-cli does exactly this when getting the provider information with the --distance flag.

I think it would be better to implement this internally within the Boost and show the lag based on latest value we get from cid.contact. So, final form would look something like.

Latest ad on indexer: baga.... (0 ads behind)
Last sync error on indexer: ""/err (x time ago)

Not needed. Implementation will be within boost, and will rely on information from ipni-cli and cid.contact.