TBD54566975 / dwn-sdk-js

Decentralized Web Node (DWN) Reference implementation

Home Page:https://identity.foundation/decentralized-web-node/spec/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tweak reads to not require a recordId, and just return the latest created record matching a query filter

csuwildcat opened this issue · comments

We could tweak Reads to not require a recordId, such that if someone passed a protocol/protocolPath it would just return the latest record to be added for that indicated filter.

What's the use case for this? This is not an intuitive behavior for a Read to me. This feels like a specific query.

Maybe we're looking for a limit option on RecordsQuery?

The use case is so that if I have a record type avatar under a Profile protocol, I can do a Read of Profile Protocol + the avatar protocolPath and know I am getting the bytes for the singular latest record under that bucket.

Does having limit: 1 or possibly latest: true on queries solve this? If this could be solved with a small addition to RecordsQuery, I don't think we should add it to RecordsRead. My further worry is that it will set a precedent for other query-like parameters on read, muddying the difference between read and query.

@LiranCohen's notes from #470 show that this is already the intention.

There is a potential to add additional fields to the description such as author, recipient, published etc for further filtering here.

I will add contextId as an option when doing a Read based on protocol + protocolPath

+1 to @diehuxx 's comments

Modifying RecordsRead in this way seems like an attempt to work around a limitation in RecordsQuery that ought to be addressed.

This use case also causes me to wonder: If we're adding query parameters to RecordsQuery, should we consider a parameter that returns the data payload regardless of the size if limit: 1?

I shared similar sentiment as @frankhinek and @diehuxx, this seems like programming sugar to me, it also feels like we are attempting to workaround the design decision of generated recordId to mimic the benefit of predefined record ID.

We had some discussion around this during office hours and mentioned that RecordsRead is almost analogous to an HTTP GET req, which fits my mental model as well.

I think the main limitation of using RecordsQuery is not necessarily limiting it to a single record, but rather actually reading the data that comes with that record as @frankhinek pointed out.

Maybe it would be worthwhile to discuss what type of improvements we can make to DataStore in order to support something like this, as we would need to allow streaming of large data with RecordsQuery. If that ends up being the case, it almost seems to eliminate any need for RecordsRead altogether.

I did add the ability for only a parentId to be passed as an additional parameter for the RecordsRead, as I thought it would be useful to read the latest record in a path you know the parent of, ie. game/score where you know the specific game you are looking for.

But, I do agree that we should scrutinize any type of filtering on RecordsRead to avoid it getting out of hand, am open to removing parentId if we decide that it's not as useful for the intended purpose.

@csuwildcat and @LiranCohen, spec & design consideration: the current PR (#470) returns latest record when there are multiple children records thus relies on a query that can return messages in order of 100s or even 1000s, and is subject to sort/paging. What is scenario for needing this? If the ask is only to support cases when a protocol path contains only one record, is it okay for us to enforce that the protocol path being read contains only one record?

@thehenrytsai from my perspective this is really for reading the latest record from any path regardless of how it's configured.

I view this as a cleaner way of doing Query + Read when you know you just need the single latest record + data for a path.

Would definitely like to get @csuwildcat's input into that.

I think the whole point here is to return the last record's data, as Read normally would, under a path if there is no recordId. I don't see any way this would result in unexpected behavior, and honestly, having it flip back and forth between working when there is 1 record and ceasing to work if there are 1+N seems like the strangest, most broken behavior of all.

@csuwildcat and @LiranCohen:

having it flip back and forth between working when there is 1 record and ceasing to work if there are 1+N seems like the strangest, most broken behavior of all.

It's not broken at all if a singleton published Profile/Image record (or the likes) is the only scenario we are looking to support, this is the only scenario I was told. Hence, still looking for a straight answer on the scenarios that need the latest record when there are multiple records that:

  1. belong to different parents
  2. belong to the same parent

Should be trivial to answer if the need is there.

The behavior of current PR (if I read the code correctly):

Say path foo/bar has 1,000,000 bar RecordsWrites: when handling RecordsRead on foo/bar the code will fetch the entire 1,000,000 messages to the client-side, then find the latest message, then perform AuthZ. No one else is concerned that this being inefficient? If not why not? The "1 record" check is only my attempt at trying to help the PR to get to a minbar mergeable state without requiring sorting/paging which would be a much larger PR. If "1 record" is no go, that's fine too, please educate me.

Also what are the scenarios for reading unpublished latest record without recordId (which is currently allowed in PR)?

Say continuing from the above example, the above 1,000,000 bars all have a different foo parent, and the latest bar is NOT published, a RecordsRead would return the latest bar only if the requester happens to be the recipient, or the author, or satisfies the protocol/grant auth rules. Is this the intended behavior? If so, how so? Again, just looking for clarification on scenarios since there isn't a spec on this stuff.

@thehenrytsai the primary need is to support a read-by-path-in-context to get the file that resides at a given path within a context. This will allow basic DEST http queries that mirror traditional REST behavior on GET of a given path. To do this we'd need to return the latest file, as that's the implicit expectation of a GET on a singular path, like "/profile/avatar" would return 1 image binary payload in an HTTP body. I personally didn't care about curtailing the call to only work if there was a single file, because it just didn't seem to matter, but your point about performance is a good one. Do we have any type of query that will just get the latest record by last written?

@csuwildcat Though I initially found the comparison to GET interesting, I'm convinced that returning the most recently updated record is a conflation of the object relational model with a way to "publish" data. If you're immovable on the idea of RecordsReading a protocolPath, we should explore that separate the two concerns. Off the top of my head: we could add a "highlighted" field to RecordsWrite, where records of a given protocolPath may only have one record with highlighted: true and the highlighted record is the one returned when reading by protocolPath. I'm sure there are even better solutions we could come up with if we start back from the scenarios you want to support.

@csuwildcat,

read-by-path-in-context to get the file that resides at a given path within a context

Can you clarify above? I am probably interpreting the above incorrectly: the current PR does NOT support filtering by contextId in anyway, when protocol path is supplied in a RecordsRead, it fetches all records having that path across all contexts for the latest one, hence my final paragraph in previous comment that, unless the records are all "published", not everyone can read it (most probably can't), which is an odd behavior.

@thehenrytsai I misspoke about contextual specificity - it's just path-based. Yes, we want a protocol path query to return the latest file under it that is published.

I'm just going to restate the requirement/goal and let that determine the course: we need a Read that responds to a path-centric fetch the way a GET would pull THE file (notice the singular) at the path example.com/company/logo, such that the actual bytes of the logo image are returned from the invocation, not just the json metadata message. Users don't want to deal with any of the juggling, they want the file at paths like example.com/pages/home to return THE home page HTML file, and not being able to do so without contortions or multiple calls is a poor developer experience.

I agree that this greatly improves developer experience.

Wrt performance, there is really no way to get around that until we add pagination to MessageStore. Even without this feature the user would perform a RecordsQuery on foo/bar which could return 1,000,000 records just to get the latest recordId and perform a RecordsRead to get the data.

@csuwildcat There is one remaining question/requirement that @diehuxx brought up during review.

When performing a RecordsRead on a path, do we expect to get the most recently CREATED or most recently UPDATED record?

@csuwildcat, thanks for clarification!

we want a protocol path query to return the latest file under it that is published.

The current implementation also does not filter on published, as long as it is the latest, it gets returned.

Your reiteration of the feature goal seems to reaffirm my original understanding (unless there are further changes in today's office hour):

You are mainly interested in enabling the fetching of the record data of a particular protocol path that is expected to be published and singleton for a given protocol.

I am committed to support the scenario you described above and was reviewing the current PR with the above understanding, but am not interested in adding unclear/unspecified behavior beyond that, which is great and as it should be IMO.

@LiranCohen, you are right, having a proper limit on fetch is the way to go, everything else seems like a temporary bandaid or hack!

One idea @csuwildcat and I tossed around in office hours today: Requiring parentId in addition to protocol + protocolPath.

Spec

A RecordsRead must have exactly one of the following

  1. A recordId.
  2. A protocol + protocolPath. If the protocolPath is a root record, then parentId is prohibited. Otherwise, parentId is required.

Rationale

A record cannot be uniquely identified by a protocolPath even if that record is a singleton*, and pulling the most recent record for a given protocolPath produces undesirable edge cases. We CAN uniquely identify a singleton record by its parentId and protocolPath.

*The current design for singleton AFAIU boils down to "one record of this type for a given parentId #467. Personally I like that design, but there's still active debate.

Adding extra requirement from @csuwildcat to support/consider:

There is a desire to NOT require/allow parentId as long as the path leads to a global singleton (only one record in the entire protocol).

Two obvious ways to implementing this when recordId is not given:

  1. Introduce and implement concept of limit in protocol configuration and fetch the protocol config. Verify that limit: 1 is specified in every layer of the hierarchy matching protocol path given in the RecordsRead.
  2. Retrieve record directly from the store and confirm that the record count for each layer of the path is 1.

I think the implementor can decide the approach. But short term, approach 2 seems quicker to ship (not necessarily more performant), because it does not depend on yet another potentially large discussion/feature/PR ($limit).

While I was putting myself to sleep last night thinking about this stuff, it occurred to me that a simple yet generalized spec for RecordsRead could be to:

Support the same filter as RecordsQuery would, but returning success (record + data) only if the query returns exactly 1 record. Error out otherwise.

This is rather intuitive to understand IMO, flexible, and would render discussion around parentId, latest record, limit etc mostly moot. It also seem to meet all use cases discussed so far.

Am I onto something, or I just need more sleep @csuwildcat, @diehuxx, @LiranCohen?

I talked to @diehuxx yesterday and she caught me up on the latest ideas that were thrown around.

I'm still digesting the idea of failing if the record has more than one result.

From my understanding the main intent with having it fail is to prevent some sort of "foot gun" where a user gets back a result that wasn't what they intended to get. So having ANY result at all lets the user know that there was only 1 result to begin with.

I think this has some merit, it makes the intent really clear, but very limited.

So this would be useful for things like profile and profile/avatar, but less useful for things like game/score or stock/tick.

@thehenrytsai I tossed around that idea initially, of just allowing users to optionally include any filters that are available in RecordsQuery, I don't think that's really bad in any way... it gives users full control, I'm just still unsure about the failure aspect.

I do think in the 'real world' examples I've been running through my head, parentId seems to be most useful, where you could have performed a RecordsQuery a list of game, and then having the parentId of the game you want you can individually RecordsRead on demand the latest score of the game.

But that's not a hill I'm willing to die on, just still digesting the idea of failing if there is more than just a single record and what use-cases that satisfies.

Some might not believe it but this is now done by #470!!