silverstripe / silverstripe-versioned

Versioning for Silverstripe model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Versions aren't created when owned relations are updated

ScopeyNZ opened this issue · comments

Description

When writing a relation that is owned by a versioned DataObject, a new version is not created. There's a few ways this can be observed and I've written a test that shows this (#196).

Steps to reproduce

  • Have a versioned dataobject that "owns" some relation
  • Update the relation (and write)
  • Note that the parent dataobject has no new version

Acceptance Criteria

  • Allows to list all changes between two versions of an owner (incl. direct and indirect owned relations)
  • Allows to list unpublished changes relative to current owner version (incl. direct and indirect owned relations)
  • Allows to determine if an owner is modified if any (direct or indirect) owned relations are modified
  • Allows to determine if intermediate owners are modified
  • Does not mark siblings and unrelated graph nodes as modified
  • Allows to determine modification status after changes have been rolled back
  • Does not mark owners as modified if an owned relation has been archived
  • Does mark owners as modified if an owned relation has been unpublished
  • Each entry uniquely points to a version of the original record
  • Groups snapshot items for each original action which triggered the snapshot
  • Tracks author and date on each group
  • Supports deep nested ownership graphs (5+ levels)
  • Supports propagating changes to multiple owners
  • Supports all owner relationships: has_one, has_many, many_many, many_many_through
  • Supports "intermediate" owners which aren't versioned (in order to reach actual owners which are versions)
  • Supports moving of owned relations to different owners without falsifying past "snapshots"
  • Does not significantly slow down save operations
  • Does not rely on creating new versions on owners
  • Is resilient to infinite loops on potentially recursive owner structures
  • Is resilient to partial data sets (can't retroactively create this data)
  • Tracks changes in subsites and fluent languages (and can filter by them)
  • It's obvious to developers reading up about versioned that this option exists (as an opt-in for now)

Subtasks

  • Get buy-in on overall implementation from one or more bespoke teams
  • Test on a project with a large number of DataObjects (> 50) and dense version histories (> 100,000 rows)
  • Resolve rollback bug (cannot currently roll back owned changes that are strictly draft, i.e. no publish event in between)
  • Investigate viabilty of overriding versioned-admin versus augmenting it (e.g. with injectable graphql queries)
  • Implement snapshot-admin graphql/react UI
    • Query to merge versions and snapshots
    • Pagination
    • Sorting
    • Suppress redundant snapshot/version overlaps (i.e. on a publish event)
    • Snapshot detail view
      • Activity feed
      • Rollback
      • Preview

Out of scope subtasks: (separate story)

  • Compare two snapshot previews
  • See changes between two snapshots

Notes

  • Out of scope: Rollback behaviour, draft preview and archive view (continue to rely on existing _versions metadata)
  • Out of scope: Tracking many_many_extraFields (should use many_many_through instead)
  • Should not be used to compute rollback states or provide the basis for any state modifications
  • Does not need to support campaigns directly, beyond providing "modified" indications on the records contained in a campaign
  • Relies on cascading publish on owned relations to "reset" change status, does not track this separately
  • Solving the definition of a "tightly coupled ownership" (e.g. owned + cascade_deletes) is out of scope
  • Sort modifications are out of scope, see #180

Related

Related PRs

It'd be great to get feedback on this problem from the @silverstripe/core-team.

Another examples of impact from fixing this "bug": a page owns a has_one featured image. You edit the title of the image in asset-admin, the previously published version of the image and the page that owns it both go into a modified state, instead of just the image.

It's important to note that next time you publish the page, you'd get the new version of the image with it anyway, so this is really an intermediary to highlight that there are draft changes in the page that you may otherwise be unaware of.

One flow on effect from implementing a fix for this in the context of silverstripe-elemental is that every time you make a change to a nested content block for a page, you'd get a new version of the page. This is probably desired behaviour, but you'll quickly end up with a tonne of page versions.

I guess it depends how tight the ownership relationship is. You'll have cases where the owned object is non-sensical without its owner (e.g.: elemental area and content blocks). You'll have other cases where the owner object can still be helpful without the owner (e.g.: a hero images that gets attached to a page).

Off the top of my head, my preference would be to keep the existing behavior and maybe add an extra super ownership relationship model on top.

Like some sort of private static $inherit_modifications = [ 'ElementalArea' ];?

Perhaps ... or maybe some options on the owns relations. e.g.:

private static $owns = [
    'ElementalArea' => [
        'tightlyCoupled' => true
    ],
    'HeroImage'
];

It might make sense to have that flow to the UI as well. e.g.: Prevent CMS users from directly publishing a content block and force them to always go through the parent page.

Hmmmn. Interesting point. /cc @newleeland @clarkepaul . It doesn't really make sense to individually publish blocks (etc) if changes to these relations make a draft version of the parent page. Will probably need some UX advice here.

Maybe it only makes sense to create a version of the page if:

  • For HasOne: the ID of the related record is changed
  • For HasMany: The set of IDs related to the parent is changed (additions/removals)

This would probably have to "flow through" for elemental though. An element that is added to an area should probably create a draft version of the page the elemental area belongs to.

And the order of IDs probably should matter as well. When you re-order a HasMany relation I'd expect a new version of the page.

Hey, I just want to flag that there's a significant risk that we wind up with an inconsistent system if we attempt to fix this without understanding the previous decisions made during the development of this during the original 4.x build.

Having @tractorcow weigh in on this will be helpful. For the rest of us, we'll want to dig through previous ticket history. I'll see if I can pick up some useful links.

Ok so these two tickets go into some discussion about the meaning of owns and its relationship to cascade_deletes

Notably: owns + cascade_deletes is the pre-existing mechanism developed to indicate "tightly coupled" objects.

I think that automatically writing the "parent" object whenever we write a child object certainly has some risks: Namely, massive saturation of the versioned table potentially increasing the ability of users to track versions. More data !== better data.

The mechanism we use internally to "segment" versions, where no specific version ID may exist for a point in time for a record, is actually "archiveDate", which is where we represent a point in time as the slice rather than the parent version number.

The problem is that we only do this for all items in the tree AFTER the top level. I.e. get a version of a record by its ID, but the versions of owned objects by date.

My suggestion is to look at allowing users to revert to specific points in time, including the top object version, rather than version IDs. The "select a version" interface becomes "select a date" interface (perhaps in addition to?).

Another possible suggestion is that when we save a record in the CMS, we create a new version of that record, and bypass the "only if changes exist" check in DataObject::write(). This check should be retained as core default, but only bypassed if the specific save is triggered by a user interaction. That way there is an immediate user feedback between "I pressed save" and "I can see a new version in the history".

What I would highly caution against is "create a new owner version every time an owned version is modified".

My suggestion is to look at allowing users to revert to specific points in time, rather than version IDs.

The alternative is to only roll back whole changesets, which was one of the reasons why we opted to package all publication events into changesets.

The alternative is to only roll back whole changesets, which was one of the reasons why we opted to package all publication events into changesets.

Yes, even better. :) However, a change may be in the wrong changeset (i.e. write of a different page to the one you want to roll back).

What I would highly caution against is "create a new owner version every time an owned version is modified".

I believe that the approach we had anticipated was that showing a "modified on draft" icon might interrogate the recursively owned objects. Do you have any recollection about this @tractorcow

I could foresee performance issues with that, but I would be inclined to create separate denormalised datasets to speed this up, rather than dumping responsibility for tracking this on the Versioned field.

So I would class this ticket not so much as a bug but as an API design decision that @ScopeyNZ believes is causing problems.

Could we clarify exactly what user features that this is causing issues with, perhaps using specific examples eg from elemental?

I get the feeling that it has something to do with the "modified" status in the tree view, and maybe something about the rollback feature, but that's a bit of a guess tbh

I think it will be useful to reframe those things as the acceptance criteria of this card and then work out what (non-breaking) changes to our API we want to make to address this.

I believe that the approach we had anticipated was that showing a "modified on draft" icon might interrogate the recursively owned objects. Do you have any recollection about this @tractorcow

Yes, I just think that would be possibly slower, but not impossible to do. I would try to avoid doing this on many items in a list, which probably is where we would get the benefit of such a function.

I have to run off and come back to this ticket later by the way. Just a drive by opinion dump at the moment sorry. :)

Could we clarify exactly what user features that this is causing issues with, perhaps using specific examples eg from elemental?

Yeah without getting caught up in the technical I think I can give a good example of the weird UX that you can achieve as it is currently using elemental as an example:

  • Create a blocks page. Publish it. Check the history - 2 versions, saved and published.
  • Add a bunch of blocks with some content and save/publish these all individually
  • Check the publically visible version of the page - all your changes are shown ( ✅)
  • Open the history for a page - the latest live version is the same as before all the blocks - and the summary of the version (and the preview) shows no blocks
  • To demonstrate this potential issue further; delete a block that you just added
  • You have no history of that block you just deleted visible within the CMS now.

I get the feeling that it has something to do with the "modified" status in the tree view, and maybe something about the rollback feature, but that's a bit of a guess tbh

I didn't consider the idea of the "modified" status in the tree view actually. Currently the rollbackRecursive actually works really well because it does it based off of time (as I understand) and meets A/Cs of restoring correct sort orders (etc).

The problem is not every action is actually captured in a version. Although ordering is still a tough problem in it's own right (refer to the "context" section of this issue and the further down comment with more examples)

I feel like adding versions when relations are specifically changed to different objects wouldn't cause a huge amount of versions - right now the lack of versions is a problem. My comment before I think makes sense:

Maybe it only makes sense to create a version of the page if:

  • For HasOne: the ID of the related record is changed
  • For HasMany: The set of IDs related to the parent is changed (additions/removals)
    This would probably have to "flow through" for elemental though. An element that is added to an area should probably create a draft version of the page the elemental area belongs to.

And the order of IDs probably should matter as well. When you re-order a HasMany relation I'd expect a new version of the page.

OK that's a bit clearer, thanks

Open the history for a page - the latest live version is the same as before all the blocks - and the summary of the version (and the preview) shows no blocks

So, this seems like the core issues. Let's start with some assertions:

  • As of SS4 we made the design decision to create a changeset for every publication event. We did this because we predicted/hoped that we might get into these issues, compared to having 2 contradictory publication models.
  • We could therefore use changesets, rather than record versions, to show a history of publications (but not a history of changes to draft :-/)

A couple of questions then is:

  • Can we link a bunch of changesets to the records that they impact?
  • Can we get away with just showing publications, our does our history view need to show draft changes?
  • If the answer is "no" to either of those, can we link a bunch of nested-record-changes to the object(s) that own them?

Merely nudging the version numbers seems like it might be insufficient inasmuch as we need to link the particular version of the owner/parent record to the particular versions to the owned/child records. Does our data model have support for this?

Changesets at least have a good model for grouping a block of related changes (like different files in a git commit) together. But, to date, we only use them for publications, and using them for draft changes is likely to have significant performance implications.

I feel like adding versions when relations are specifically changed to different objects wouldn't cause a huge amount of versions - right now the lack of versions is a problem.

It's not about the number of versions, it's about whether you're breaking an assumption on which the entire rest of the application depends.

In general, it seems like we have a few broad approaches for addressing this:

  • Recursively tickle the version number of every owner object when an versioned object is written (Guy's solution)
  • Recursively traverse the owned objects to look for changes, whenever building a history list.
  • A middle approach:
    • When a versioned object is written, a related-change record is written linking the recursive owner objects to this.
    • When a history list is built, source it from both direct changes and related-change records

The 3rd approach is similar to the 1st option, but introducing a new place to write it rather than overloading the meaning of Version. It's also similar to a pre-emptively written caching layer for the 2nd option.

I fear that going option 2 will fall into the same trap we've been experiencing very recently on a certain key project - where code like this doesn't perform well on larger sites.

I'm not entirely sure on the differences on the first and third approach because I'm admittedly still a bit unfamiliar with the intricacies here. I'll do some delving and form an opinion over this week.

Probably worth pointing out that ChangeSet and ChangeSetItem are designed with the idea that they could be reverted after the fact ... but we didn't quite get there.

Right now, when you publish recursively, a ChangeSet is created with matching ChangeSetItems for each affected object. All of these are saved to the DB and pretty much become useless afterwards. Basically, there's no way for the user to interact with that change set data. Even if there was, there's nothing useful you can do with them right now.

Yeah, making improvements in keeping with the current api design would be nice if possible ;-)

If we were to go with a change set approach, how would that affect devs with simple ORM relationships? Very basic example from my comment earlier: a blog post has one featured image. You modify some text field on the image and save, the page doesn't go into draft state even though it will change when you publish it (in so much as its featured image). Note that the dev hasn't even seen the name "ChangeSet" during this process, so it'd need to be wrapped up under the hood.

I kind of see what you mean in terms of where you're making the change - you haven't modified the page at all, so why would you bump a version for it, but I think it's quite important to think of this holistically as content changes in general. Elemental is a good example where a page's data is mostly all stored in nested ORM relationships on the page rather than fields attached to it which are covered by the page's versioning by default. The more you do that the more and more useless versioning becomes for pages (the heart of the CMS).

When I was thinking about it the other day I had a suspicion that a new API like @maxime-rainville suggested at the top would be a safer option. It'd kind of be like cascade saving but in the opposite direction to cascade publishing at the moment.

It's hard to recommend any solution since they all have issues; Either you are inflating the number of rows / writes, or increasing the complexity / performance of reading when building history list.

commented

For things like reverting, I agree that if we modelled things based on time and what versions/relations existed at that time the system would make a lot more sense to people.

As for publishing a block, I'd like to see that you can publish a block without need to publish the page to see the changes (It's quite important people can publish blocks independently from each other).

I need some time to read through this issue in a bit more detail as there is lots going on here, I'll be back :)

It's hard to recommend any solution since they all have issues; Either you are inflating the number of rows / writes, or increasing the complexity / performance of reading when building history list.

Fair enough; I think this is a good reason not to rush into this. However, I don't think we can leave this as a known issue so we'll need to choose something.

Given that reads are generally at least an order of magnitude more common than writes I'd say we want to optimise for read performance. This could be through our database structure, or through caching layers.

Unpacking Guy's original suggestion

I'd like to unpack the original suggestion of bumping the Version number every time and owned record is changed (or possible an owned + cascade_deletes record*, which is the current semantic for what has been called "tightly coupled ownership", a relationship of composition rather than mere use).

What this would means is that the following change would need to trigger a version bump:

  • Add child record
  • Delete child record
  • Modify child record (which presumably includes reordering)

It raises a few questions:

  • For it to be semantically coherent, it should be possible to look up in the database the list of children for any one of these parent versions. After all, in a history explorer, that's precisely what we need to do: show a list of historical changes and then present the data at that point in time when we click on it. Does our data model allow for this, and if so, how? I feel that at this point "just bump the Version number of the page when a block changes" may not look like such a simple solution. That's not to say it's a bad start, but that we've only gone half way in our design. I suspect that the complete solution will start to look more like an extra table that links each change with the owner records that it relates to, but right now that's just a guess.

  • At minimum we'll get a version created for each block added. This could increase the number of versions by 10x quite easily. This would risk creating a lot of duplicate data in cases where you have a lot of SiteTree fields that don't change all that often on a page with a lot of frequently changing elements. Is this much of a problem? Version tables have historically been a big part of database bloat, but we could implement some garbage collection for this, and indeed probably should do so regardless of this change.

  • Do the extra versions that we're creating here break any important assumptions for developers using the Versions table for other purposes? Given that it's mostly about history scanning, this is probably "fixing a bug" from the perspective of other use-cases too, but we shouldn't gloss over this issue.

Recommended next step

My recommendation would be that, using SiteTree + Blocks as a specific use-case, we flesh out details of the data model that would allow for:

  • New versions to be listed that include changes to blocks as well as changes to the sitetree
  • The set of blocks for each SiteTree to be listed for each of those versions, considering the cases of added blocks, removed blocks, and modified blocks
  • When mapping this out, we should make not of impact to write performance, as well as impact to read performance
  • Finally, with all that in place, we can look at what changes to the current model are needed, which is our implementation plan and let's us validate that we haven't broken any APIs

In my view, that would be a complete solution to the problem identified here, that we can judge on its merits.

Footnote

  • My strong recommendation would be that we put the issue of "is owned + cascade_deletes the best way of indicating a tightly-coupled / composition relationship?" to one side, and pick it up as a separate issue, in order to limit the scope of this already-complicated discussion.

Version tables have historically been a big part of database bloat, but we could implement some garbage collection for this, and indeed probably should do so regardless of this change.

I thought about this too. Would need to be configurable, but I think we should add this.

Regarding your recommended next step, would that also involve moving away from archived record retrieval operating on a "archive date" filter basis to a version ID basis, or is that not relevant to this?

but we could implement some garbage collection for this, and indeed probably should do so regardless of this change.

#198

Version tables have historically been a big part of database bloat, but we could implement some garbage collection for this, and indeed probably should do so regardless of this change.

I thought about this too. Would need to be configurable, but I think we should add this.

We utilise https://github.com/axllent/silverstripe-version-truncator for this, it works quite well on large data sets. It is expensive to run though.

Given that reads are generally at least an order of magnitude more common than writes I'd say we want to optimise for read performance. This could be through our database structure, or through caching layers.

I'd err with caution on this. Writes are already an expensive operation on large data sets within the SS ORM and on a large dataset with continuous stream data (think importing from a stock management system for example) this can slow down the write performance quite a lot.

I think that the read operation of a set of version changes is something that you would expect to be used on a fairly ad-hoc basis and I personally wouldn't mind a trade off in write speed for a longer load when viewing versioned data (or changesets). In a practical situation, we rarely even allow access to the version set for most of our administration permissions, so being able to view past versions isn't something which gets run (literally at all) on our larger instances.

I don't want to make suggestions to the outcome here, but I'll just throw out some thoughts and learnings that we had from the last couple of large Block based projects.

I'm also not suggesting that this is how it will be for everyone.

Our content authors find managing different published/draft states between a Page and it's Content Blocks to be confusing

From their point of view, the Blocks were just a way of organising the forms for a Page's content. They don't see them as being separate DataObjects, they just see them as "the Page's content".

For example, if they update some of the Page's other content (metadata, maybe), they would need to publish the Page in order to see that change on the frontend. They felt the exact same way about Blocks (because to them, a Block is just another piece of content, like the metadata).

In response to this, our solution for the current project (which has been well received) was to completely disable the publishing of Blocks independently from the Page. We rely on the Page level UI to indicate to an author when the Page needs publishing (and this is based on the current draft/published state of the objects it owns).

The gap in the experience above (and this could be a project level solve), is that the authors expect to be able to "Cancel draft changes" when they've updated a bunch of Blocks, and then realise that they don't like those changes. This interface isn't exposed by default at the moment, because no new draft state was created for the Page.

Hopefully that author experience is useful to someone 😃

This stuff have some up in two other contexts already:

The first issue has a bunch of observations on the behaviour and limits of ChangeSets. It's also somewhat related to Searching blocks content as part of pages in the CMS.

Trying to get my head around the use cases we're trying to cover:

  1. Publish records recursively incl. owned relationships
  2. Publish records in owned relationships without forcing a publish of the owner
  3. Delete records recursively incl. owned relationships (cascade_deletes)
  4. Cancel draft changes recursively incl. owned relationships
  5. View current draft version of content incl. the drafts of any nested and owned relationships in the CMS and Preview
  6. View historical version of record incl. any nested and owned relationships in the CMS and Preview
  7. Rollback to old version of record incl. any owned relationships
  8. Create version of owner record when owned relationships change
  9. Mark owner record as changed when owned relationships change (same as above?)
  10. Collect related records (often with owner relationships) for batch publication through a campaign
  11. Rollback a campaign incl. all included records
  12. List changes on owner record recursively for owned relationships (activity feed)

Discussed a bit offline with Sam, the larger issue here is that we're using a relational database as a graph database. Ideally we would inline has_one and has_many relationships into JSON fields on the original record (very common in NoSQL land). That brings a whole bunch of other denormalisation issues, and would likely force us to create yet another abstraction layer on top of our ORM, as well as developers dealing with a dual persistence model (and a significant data migration).

In response to this, our solution for the current project (which has been well received) was to completely disable the publishing of Blocks independently from the Page. We rely on the Page level UI to indicate to an author when the Page needs publishing (and this is based on the current draft/published state of the objects it owns).

The gap in the experience above (and this could be a project level solve), is that the authors expect to be able to "Cancel draft changes" when they've updated a bunch of Blocks, and then realise that they don't like those changes. This interface isn't exposed by default at the moment, because no new draft state was created for the Page.

So the UI of the blocks just include a Save and archive options (but no publish/unpublish)? @chrispenny

@ScopeyNZ @clarkepaul and I were discussing this idea you were talking about. Its seems to be an short term options (in terms of UX) for those who don't need to publish elements separately. (Without doing a full fledged UX fix)

UX-wise the confusion comes from communicating child objects being a part-saved/published, without showing them as part-saved/published history records.

@newleeland that's correct - only the save and archive actions are available on each Block.

Let me know if you'd like to see a bit more of our UI and I can PM you some screen shots. We certainly haven't solved all of these problems though..

@clarkepaul @newleeland I'd be really interested to understand why it's so important that individual blocks can be published without simply publishing the parent page, as it risks significantly increasing the UI complexity – for comparison we don't provide an option to publish the Content field but not the Title.

What are the use-cases where this feature becomes important?

@chrispenny my main comment would be that this as much as possible that kind of change should be done as PRs to elemental (e.g. with an "enable_single_block_publish = false") flag rather than as project code. Probably too late now for your specific case but maybe something to migrate towards?

@sminnee definitely - for us though, this sort of interface is used on many more objects than just Blocks. We have a generic Extension applied to VersionedGridFieldItemRequest which allows us to control this for any/all DataObjects through config yml.

Possibly this could have been moved into a PR for the Versioned module, but with it being 11 months old, I can't really speak to why it wasn't.

commented

@sminnee The reason for allowing blocks to be published separately came down to both research and the direction of "all" items having versions. For the research part we found that for larger websites authors worked on different blocks from each other (specially with these longer pages). One author didn't want to publish someone else's draft work specially if it needs to get approval from someone else. (eg. if block1 block needs to go live tomorrow but you have things to publish on block2 today). A client gave examples of their Legal team being responsible for some parts of a page and Marketing another.

Blocks are originally modelled off the way Gridfield works and that separation of functionality hasn't totally happened yet. So although we are talking about blocks the same rules need to be considered for other DataObjects.

In @chrispenny 's case they are in an in-between state where blocks are managed in a different view so I can understand the reasoning to remove the publish action from blocks, it is super confusing to publish both a block and then a page. We have progressed somewhat with the latest block updates and edits are made on the page view, so we anticipate people will use the page actions the majority of the time as they are the visible actions.

For reverting might need to consider showing in-between major version history (tracking block changes). We have some page Activity feed designs which go into more detail of the incremental changes (John edited block 2, Joe reordered blocks). We might need this type of history tracking with the page history if you want to revert to an exact change/time. Blocks items should also eventually have their own history.

@ScopeyNZ @robbieaverill @chillu @maxime-rainville just to clarify my position on this (as outlined in my 4 Dec comment): I definitely think that this something we should fix, I'm not sure confident (without testing) that the simply adding Versions to a parent record whenever a child record is changed will actually fix the issue:

  • I can see how it would add lines to a history view 💚
  • But will it match older versions of the blocks with the older version of the parent record? ❓
  • And will it restore the right versions of the blocks when you roll back the parent record? ❓

All 3 of these need to be addressed for the fix to be meaningful. I don't think fixing 1 without the other is of any value – it will just add to the confusion.

A good place to start might be to write some tests that cover these cases, and getting agreement on that?

Here's a starter as some pseudocode. I think this covers the 3 points above in a reasonable set of edge-case (add/modify/remove, child and grandchild)

  • Test structure: TestObject has_many+owns ChildObject many_many+owns GrandchildObject

testHistoryIncludesOwnedObjects

  • Create TestObject $a
  • Assert $a->history() has 1 entry
  • Add ChildObjects $a1 and $a2
  • Assert $a->history() has a 3 entries
  • Modify object $a, write.
  • Assert $a->history() has $a 4 entries.
  • Modify object $a1, write.
  • $a->history() has a 5 entries.
  • Create GrandChildObject $a1i
  • Add $a1i to $a1
  • $a->history() has a 6 entries.
  • Modify $a1i
  • $a->history() has a 7 entries.
  • Remove $a1 from $a
  • $a->history() has a 8 entries.

testVersionsIncludeCorrectVersionsOfRelatedRecords

  • Create TestObject $a
  • $v[1] = $a->Version
  • Add ChildObject $a1
  • $v[2] = $a->Version
  • Add ChildObjects $a2
  • $v[3] = $a->Version
  • Modify object $a, write.
  • $v[4] = $a->Version
  • Modify object $a1, write.
  • $v[5] = $a->Version
  • Create GrandChildObject $a1i
  • Add $a1i to $a1
  • $v[6] = $a->Version
  • Modify $a1i
  • $v[7] = $a->Version
  • Remove $a1 from $a
  • $v[8] = $a->Version

(Then assert)

  • Get $a_v1 = $a->Version($v[1]).
    • Assert $a_v1->Children() is empty.
  • Get $a_v2 = $a->Version($v[2]).
    • Assert $a_v2->Children() contains unmodified $a1
    • Assert $a1->Children() is empty
  • Get $a_v3 = $a->Version($v[3]).
    • Assert $a_v3->Children() contains $a1 and $a2
  • Get $a_v4 = $a->Version($v[4]).
    • Assert $a_v4->Children() contains $a1 and $a2 and $a_v4 is the modified version
  • Get $a_v5 = $a->Version($v[5]).
    • Assert $a_v5->Children() contains $a1 and $a2 and $a1 is the modified version
  • Get $a_v6 = $a->Version($v[6]).
    • Assert $a_v6->Children() contains $a1 and $a2 and $a1 is the modified version
    • Assert $a1->Children() contains unmodified $a1i
  • Get $a_v7 = $a->Version($v[7]).
    • Assert $a_v7->Children()->filter([Name=>"A1"])->Children() contains $a1i and it is the modified version
  • Get $a_v8 = $a->Version($v[8]).
    • Assert $a_v8->Children() contains only $a2

testVersionsRollbackToCorrectVersionsOfRelatedRecords

  • The same as above, except instead of getting the different versions, roll back to those versions, flush any DataObject caches, get the stage version, and perform the assertion on that

Having written that test I retract one of my original points. I do think that the Version number of a parent record should be bumped when a child record is modified (or added, or deleted).

This is because not only do we need to have a history() entry for each of these states, we also need to have some kind of version identifier that we can use to look up historical data and/or roll-back to it.

If we don't bump the Version number, then we're going to end up with a second number that represents the version of the object-cluster, and although that could be a path we go down, it seems, on the face of it, unnecessarily complicated.

There's still some ambiguity over exactly what data model changes would be needed to make my test above pass.

One approach would be to have both and ID and a Version for each has_one stored in the _Versions tables. That way you could look up a has_one record easily.

To look up the has_many items for a specific version, you'd need to filter the child records where the has_one matches both the ID and the version. This would also mean that each time to bump the parent object's Version number, all the child versions need to be duplicated. This could get messy; so perhaps the has_one stores an ID and a list of Version numbers for the related object.

The alternative approach would be to pull these related versions into a whole separate data structure. For example, each Version of the parent object could list, in a separate table, the records and versions of its graph of owned data. This would minimise the chance of accidentally breaking other uses-of this data.

There's a PR already started with tests FYI: #196

I'm happy to add some more tests outlined above but it will probably take me at least a week to find time 😅

It's option time again, incl. some crazy ones :D I find it easier when you can reference them directly. I've started this post a few hours ago, and events continued here since then hah.

  • Option A: Recursively tickle the version number of every owner object when an versioned object is written (Guy's solution)
  • Option B: Recursively traverse the owned objects to look for changes, whenever building a history list.
  • Option C: A middle approach: When a versioned object is written, a related-change record is written linking the recursive owner objects to this. When a history list is built, source it from both direct changes and related-change records (see Sam's comment)
  • Option D: Only allow author interactions with history (viewing, rollback) on ChangeSet level. A "cancel draft change" would only be possible in context of a ChangeSet, not any arbitrary point in the ownership chain. Will likely still require the "related change record" approach from Option C in order to mark owners as changed
  • Crazy Option E: Change our datamodel to inline owned relationships into nested data on a single database column in the "root" owner record (see the MySQL JSON data type and phptek/silverstripe-jsontext). This is similar to how Wordpress Gutenberg does blocks, although I haven't played with their versioning capabilities (and naturally assume they're not great heh). I think this would be a fairly extreme change to SilverStripe, even in 5.x - but mentioning it here for completeness.
  • Crazy Option F: Delegate content persistence to a JCR content repository with those abilities baked in, e.g. like ezPublish. See https://phpcr.readthedocs.io/en/latest/book/versioning.html and the concept of workspaces on https://phpcr.readthedocs.io/en/latest/book/introduction.html. The Typo3 fork NEOS has workspaces as well: https://www.neos.io/features/workspaces.html

I've found this Stackoverflow post useful in terms of the data modeling aspects.

commented

I just had a chat to @ScopeyNZ to clarify the differences and it sounds like opt. B/C are the closest to how I imagine it working UX wise. Also I don't see that we need to show the version numbers of the owned items within the context of the owner.

I think we can park Options E and F as "maybe for SS5" and recognise that we need a v4.x (ideally v4.4) solution to this substantial gap in our current featureset when making block-based CMSes.

  • Option A isn't a complete solution, but could be the start of one. Specifically, it fails to clarify how the correct child records would be associated with each parent-version. My previous comment makes some suggestions.

  • Option B probably makes that problem easier, but doesn't clarify what you would use to identify each of these versions and look them up later - you would need some kind of separate list of versions, or a composite version identifier.

  • Option C feels a bit more complete, although I think that you'd still probably want to bump the version numbers of the parent records as in Option A to have a coherent system.

  • It seems like Option D would force the creation of changesets for most change activities and not merely publication, so that could view and roll back to previous draft versions that were never published. Feels like a can of worms.

I think any of option A/B/C could produce similar UXes, but not option D.

I think that Option C + Option A in combination is probably the best road to a complete solution that doesn't have massive side effects.

Specifically:

  • Bump the owner's version every time owned objects are added/removed/modified
  • For each owner version, write to a VersionSnapshot table the IDs and version numbers of all related objects in its owned descendants.
  • When you retrieve a object of a specific version, use the VersionSnapshot table to fetch the right versions of any relations.

VersionSnapshot might have a structure like this:

  • OwnerClass (enum/varchar) - TestObject
  • OwnerID (int)
  • OwnerVersion (int)
  • OwnedClass (enum/varchar) - ChildObject, GrandChildObject
  • OwnedID (int)
  • OwnedVersion (int)

So when you're fetching relations of TestObject #5 version #6, you could add this to, for example, a query on ChildObject:

INNER JOIN VersionSnapshot vs_ChildObject ON vs_ChildObject.OwnerClass = 'TestObject' AND vs_ChildObject.OwnerID = 5 AND vs_ChildObject.OwnerVersion = 6
AND vs_ChildObject.OwnedClass = 'ChildObject' AND vs_ChildObject.OwnedID = ChildObject_versions.RecordID and vs_ChildObject.OwnedVersion = ChildObject_versions.Version

For each owner version, write to a VersionSnapshot table the IDs and version numbers of all related objects in its owned descendants.

That can amount to hundreds of reads and writes on each record save (our key project has pages with 100+ blocks, each of which can have more owned object structures). And unfortunately we can't make this an async task, since it's the versions at this point in time. They might've changed when a queued job kicks off a minute later.

Looking at your example, you're cascading this owner version bump up the ownership graph (from $a1i to $a1 to $a). Would you also have to go "sideways" to all the siblings in the same ownership graph? That's a lot of writes, but I can't see how your query approach would work without it?

Maybe we could separate a fast "save" operation from a slow "record history state" operation, in terms of synchronous requests from the client? It'll either require some form of semaphore locking, or we live with the possibility of getting inconsistent records. Probably a shit idea...

I guess there's no way to model this as a nested set in relational databases, because it'll be way too expensive to modify over time. We might have some fun with hierarchical queries though.

In the end, I think before diving too deep in the code here, we should come up with a realistic data set and think through how many read/write queries would need to run for both the "save history" and "view history" cases.

Yeah that could potentially be a lot of writes, although it would be an append-only dataset so perhaps there are ways of keeping that reasonably efficient?

The reading, on the other hand, would be a join, which we could hopefully keep efficient with indexes ("100s of records" isn't scary to a RDBMS, and we're only talking about history-view)

Option A+C mk 2

Let's assume this volume of writes is a non-starter. An alternative would be to could do some kind of "first-version" / "last-version" marker so that a snapshot can apply a range of owner-versions rather than a single version, so that it's only when a child record is added, removed, or modified that the index needs to be rewritten.

So you have more like MinOwnerVersion and MaxOwnerVersion, and either of them can be null to mean unlimited, and your query would then be:

INNER JOIN VersionSnapshot vs_ChildObject ON vs_ChildObject.OwnerClass = 'TestObject' AND vs_ChildObject.OwnerID = 5
AND (vs_ChildObject.MinOwnerVersion IS NULL OR = vs_ChildObject.MinOwnerVersion <= 6)
AND (vs_ChildObject.MaxOwnerVersion IS NULL OR = vs_ChildObject.MaxOwnerVersion >= 6)
AND vs_ChildObject.OwnedClass = 'ChildObject' AND vs_ChildObject.OwnedID = ChildObject_versions.RecordID and vs_ChildObject.OwnedVersion = ChildObject_versions.Version

That way:

  • If you add a child record, set its MinOwnerVersion to owner's new version
  • If you delete a child record, set its MaxOwnerVersion to owner's previous version
  • If you modify a record, then
    • set the MaxOwnerVersion to the owner's previous version
    • create another VersionSnapshot record with MinOwnerVersion = owner's new version

And you shouldn't need to manipulate the VersionSnapshot table for data you aren't modifying.

I think there's some merit in the idea of option D somewhere here. ChangeSets are under-utilised as far as I know, since you have to explicitly opt in to using them in your code - simply opting into using Versioned on a DataObject doesn't use ChangeSets out of the box. If we could update so that a ChangeSet represents a set of draft changes, and when published it adds the published state and the ChangeSet is published/finalised, that would at least encapsulate groups of changes. I don't think it would directly help this issue though.

In terms of the performance impact of having extra joins etc, I agree with @sminnee that I don't think it's a big concern, provided we do it in a single query with efficient joins (note that the existing joins on archive date are not performant in large data sets). We run into performance issues when we're doing a large number of queries rather than querying a large number of records (with the linked exception above), provided we use indexes effectively which I think we do for the most part.


In terms of the data structure - I'm using elemental attached to a Page as an example and just looking at the DB table for Page (ignore StringTags):

mysql> select * from Page;
+----+------------+-----------------+
| ID | StringTags | ElementalAreaID |
+----+------------+-----------------+
|  1 | one,two    |              12 |
|  2 | NULL       |              24 |

The relationship here is Page has_one ElementalArea, and ElementalArea has_many Elements.

In my view, the problem is that we're tracking the ElementalAreaID but not the version for it (note that the Page_Versions table is the same as this). I can only assume that the only way we can work out which version of ElementalAreaID 12 or 24 should have been attached to a specific record in Page_Versions is by the archive date join, which is an unstructured approach which doesn't scale well (while it does work well for page previews to give wider context).

If we were able to augment the Page_Versions table to include a ElementalAreaID_Version column (foreign key to ElementalArea_Versions.ID), we immediately have structured data. In terms of Option A this would now be easier to achieve - you create a new modified version of an Element, we add logic that says "update the associated version of this record for anything that owns it", and you automatically get a new modified version of that owner as well - this would be recursive.

Potentially the Element_Versions table (a has_many object owned by ElementalArea) would need to track the ParentVersion as well as the ParentID too:

mysql> select * from Element_Versions limit 1\G
*************************** 1. row ***************************
          ID: 1
    RecordID: 1
     Version: 1
WasPublished: 0
  WasDeleted: 0
    WasDraft: 1
    AuthorID: 1
 PublisherID: 0
   ClassName: SilverStripe\ElementalBannerBlock\Block\BannerBlock
  LastEdited: 2018-10-18 02:53:11
     Created: 2018-10-18 02:53:11
       Title: NULL
   ShowTitle: 0
        Sort: 1
  ExtraClass: NULL
       Style: NULL
    ParentID: 2

I kind of just smashed my thoughts down into this comment, but what do we think about this?

Tracking the ID & version of a has_one in the _Versions tables make sense on the face of it, but I shied away from this because it would lead to a lot of extra records, and risk creating an infinite loop.

Assume the data structure Page has_many Element, Element has_one Page.

  • If Element_versions has PageID and PageVersion then you need to create a new Element version whenever Page version is changed.
  • The Page version also needs to be bumped when
  • So, you get a situation
  • Element is changed
  • Page version is bumped
  • All other Elements need to have their version bumped to attached to the new Page version.
  • ...which means that the Page version needs to be bumped? :trollface:

Now we could protect against that infinite loop with some careful coding, but it does highlight the fragility of the model. Also that's a lot of writing of Element_versions (a change to any Element requires the change of all other elements). Even with DB optimisation it seems like this would easily get quite slow.

I think that a has_one shouldn't be tied to a single version but to range of versions. So, like my example in my previous comment, you could have:

  • Element.PageID
  • Element.PageVersionMin (can be null/0 to indicate "unbounded")
  • Element.PageVersionMax(can be null/0 to indicate "unbounded")

This wouldn't require a rewrite of sibling records (as they would all have Element.PageVersionMax = null), and so keeps the writes in check.

The general concept of the model is that the most recent version of an object would have PageVersionMax = null, and PageVersionMax would be set whenever a record was modified or deleted.

This saves the need for making a separate VersionSnapshot table, so might be an improvement.

I like @sminnee's idea. In the case of a many_many we can introduce the min/max to the pivot table as well.

  • This allows us to not have new version numbers for the page.
  • We can report on Element changes between page versions: WHERE Element.PageVersionMin = Page.Version OR Element.PageVersionMax = Page.Version
  • Getting all changes in a list can be done with a UNION I guess?: SELECT Version, "Page" as ChangeType FROM Page_Version WHERE PageID = 1 UNION SELECT Version, "Element" as ChangeType FROM Element_Version WHERE PageID = 1 (or just two separate queries 😂 )

What other "acceptance criteria" should we be looking to meet with this solution?

In the case of a many_many we can introduce the min/max to the pivot table as well.

Currently, many_many is incompatible with versioning and "many many through" should be used. So we could just skip many_many support tbh.

This allows us to not have new version numbers for the page.

Does it? How do you request the intermediate historical versions? It feels like it could get messy to think about how would identify each version based on owned-record-changes, esp if you think about grandchild records, etc.

I think we still need to nudge the Page version each time an element changes, but we don't have to rewrite the Elements when we do this.

How do you request the intermediate historical versions?

I'm very confused as to how this works right now. There's no current record of which versions of elements belong to a page version, but it does seem to rollback correctly if you manually make a page version to roll back to?

I thought that rolling back and previewing versions was based on timestamps and not necessarily the version number but I can't seem to find that in the code. I don't really know how it works 😖 .

If Element_versions has PageID and PageVersion then you need to create a new Element version whenever Page version is changed.

Good point. I guess a pivot table would be the least disruptive idea to get around this, rather than adding PageVersion to the Element_Versions table

I thought that rolling back and previewing versions was based on timestamps and not necessarily the version number but I can't seem to find that in the code.

See https://github.com/silverstripe/silverstripe-versioned/blob/1/src/Versioned.php#L638. It took me about ten minutes of digging the other day to track that down as well. I'm hoping that we don't end up with a solution with both supports date-based and a new versioned-state-based viewing and rollback. I'm not sure how realistic that is, because date-based viewing on previews also includes versioned items which aren't in an ownership relationship. If we keep both approaches, there's always the potential that your rollback actually looks different from what you're previewing.

Just to spell out the obvious: We're discussing a fourth versioning approach here, some of which overlap in purpose: *_versions, ChangeSets, archive-date query augmentation, and whatever we come up with here.

I think any approach we create should be validated against the dozen use cases I've listed earlier (they're numbered, so easy to reference). In particular, I want to avoid designing a solution which prevents us from doing secondary goals like an activity feed, or efficient marking of modified records in batch requests like generating a page tree.

But also, just to ensure we're not getting too sidetracked with complexity here, let's remind everyone what the main driver in our key project is for this discussion (as far as I understand): The need to identify and cancel draft changes on a page level incl. changes to content blocks. This button is currently not showing because content blocks don't trigger the change status on it's owner (elemental area and page).

Conceptually, I'm finding it helpful to think about this as a "version graph snapshot" whenever anything changes in the ownership graph. This might be a partial snapshot if we can find ways to create a full snapshot by assembling these partials. And we need to find efficient ways to both create this snapshot, and identify the right snapshot for an object context based on arbitrary entry points in this ownership tree (e.g. page vs. element). This snapshot is called a generation in the SO post I've linked to before. You might also call it a "global version number", in addition to the "owner version". This would allow for diffs on the entire object graph (e.g. find all changes between Page version=1 generation=def23a and Page version=4 generation=abc34b, which could include Element version 5 generation=ffd44a) in a single query. By making the generation a UUID, we could avoid global locking issues around concurrent writes in a multi-user CMS. Creation dates would be used to order generations (e.g. for an activity feed). Any writes related to a generation would need to live in a database transaction to ensure consistent date ordering on generations.

Generations aren't solving all problems listed here, but for Sam's Option A+C mk 2 I can't see a performant way to determine if an ownership graph has draft changes relative to a specific owner version, without doing expensive traversing of the whole graph - a non-starter for use cases like marking objects changes in a tree view.

EDIT: Since the idea of generations relies on associating the object graph to a specific generation on each write, it's actually no different from just checking if the owner you're interested in has a new entry in the proposed VersionSnapshot table.

(Dead end, but recording this in case anyone else goes down this path): I was wondering if we can achieve some efficiency gains by constraining the solution to assume a "root owner", which would by default be pages. So every owned child would also create version entries relative to this root owner, not it's parent owner. It'd make activity feeds, modification status and archive view query augmentation a lot easier as long as you're doing those in context of the root owner. It would save us from bumping versions all the way through the ownership chain, and speed up save/publish operations. But without these intermediary version bumps, you don't get consistent rollback and preview behavior (beyond the current archive-date-based approach), so it's a non-starter. Also, each object can have more than one root owner, derp derp.

There's a few requirements on any solution here:

  • Needs to support deep and wide ownership graphs (5+ ownership levels with 100+ objects in graph)
  • Save and publish needs to be reasonably fast at every point in a deep ownership graph
  • State shown in preview needs to be what you can actually roll back to (either through relational integrity or date-based)
  • State shown in preview needs to account for non-owned objects
  • Modification status needs to be a fast aggregation across dozens of ownership graphs
  • Preview and CMS viewing of draft changes can't be much slower than actually viewing published content
  • Preview and CMS viewing of archived versions can be a bit slower (less used), but not time out
  • The database size should not grow exponentially relative to the amount of modifications and new records

There's a few expensive things to optimise:

  • On insert/delete/write of owned objects
    • get the whole ownership graph at this point in time
    • record changes on the direct owner of a modified object
    • record changes for every object in an ownership graph
  • On listing of owner objects (e.g. in a tree)
    • get modification status based on ownership graph of each object
  • On listing of history on a single owner object
    • get the list of changes across the ownership graph
  • On preview of drafts and archive viewing
    • get the correct versions for all viewed objects (regardless of ownership graph) based on the currently selected version (or date)
  • On rollback
    • get the correct versions for all objects in the ownership graph based on the currently selected version (or date)
    • write new versions for all affected objects in the ownership graph

Had a good whiteboard session with @ScopeyNZ. I think we should try to stick with the date-based preview, archive and rollback behaviour, and not try to achieve full relational integrity on versions of an ownership graph. This is assuming that while there might be some bugs in the current date-based rollback behaviour (e.g. around deleted records), they can be fixed and aren't structural flaws.

But in order to allow for fast aggregate change tracking, I think we need to record the path from the originating change up the ownership graph, excluding siblings. I can't see a scenario where we can allow for fast aggregation without this level of denormalisation. Hence the need for a VersionSnapshots table.

I'm suggesting a VersionSnapshot table with the following structure:

  • UUID (Varchar)
  • Created (Datetime)
  • ObjectClass (String)
  • ObjectID (Int)
  • ObjectVersion (Int)
  • IsPublished (Bool)

Every save and/or publish operation to an object in the ownership graph would follow up the chain of its owners, and add an entry for itself and each of its owners. This operation would be wrapped in a transaction, and receive a UUID.

Then you can use the following queries to identify the modification state of the whole object graph:

  • $lastPublishDate = SELECT Created FROM TestObject_Live WHERE ObjectClass = 'TestObject' AND ObjectID = 5 ORDER BY Created DESC LIMIT 1
  • $uuids = SELECT UUID FROM VersionSnapshot WHERE ObjectClass = 'TestObject' AND ObjectID = 5 AND Created > $lastPublishDate
  • $isModified = SELECT COUNT(*) FROM VersionSnapshot WHERE ObjectClass = 'TestObject' AND ObjectID = 5 AND UUID IN ($uuids) AND IsPublished = 0 ORDER BY Created DESC GROUP BY ObjectClass, ObjectID

There's probably a way to do this in less than three queries, but it's a hell of a lot more efficient than traversing the whole object graph and comparing PageMaxVersion etc. A similar approach can be used to roll up all changes in an object graph for an activity feed. But it would not be used for rollbacks, since it only contains partial data of the object graph.

Scenarios

In the following scenarios, we're assuming ownership relationships between records of type A, B and C. I've staggered the version numbers for clarity.

Scenario 1: Bump page versions

Step 1: C1 is saved (not published), creating a new draft version v31. This creates new versions up the ownership chain, but not to siblings (B1 and A1, not B2). B1 and A1 are marked as modified due to this.

image

Step 2: C1 is published, creating new published version v32, and new versions up the ownership chain. B1 and A1 are no longer marked as modified.

image

Scenario 2: Version Snapshots

Step 1: C1 is saved, creating a new draft version v31. No other versions are created up the ownership chain. B1 and A1 are marked as modified due to the VersionSnapshot data being written below, not because their own version has increased.

image

New VersionSnapshot entries are created for each record

UUID Object Version IsPublished Created
def123 C1 v31 0 12:30:00
def123 B1 v20 1 12:30:00
def123 A1 v10 1 12:30:00

Step 2: C2 (a separate child under a sibling) is published (not saved). No new version of C1 is created.

image

New VersionSnapshot entries are created for each record, one set for the "save", and one for the "publish" (existing SilverStripe behaviour). C1 is still unpublished, meaning B1 and A1 are still marked as modified.

UUID Object Version IsPublished Created
def123 C1 v31 0 12:30:00
def123 B1 v20 1 12:30:00
def123 A1 v10 1 12:30:00
abc456 C2 v51 0 12:40:00
abc456 B2 v20 1 12:40:00
abc456 A1 v10 1 12:40:00
fff678 C2 v52 1 12:40:01
fff678 B2 v20 1 12:40:01
fff678 A1 v10 1 12:40:01

With the query examples above, we can effectively ask "give me all the snapshots where A1 is involved after it has last been published". Then "give me the latest versions of each object involved in those snapshots, and check if any of them haven't been published". I think that deletions would be treated the same way as modifications (marked as IsPublished=1).

I'm further suggesting that we try to build this VersionSnapshot behaviour as a module, rather than committing to this in core while there's so many unknowns.

The tradeoff here is that concurrent writes involving the same object graph path by different authors could end up causing incorrect modification status information in that chain, e.g. if the same object is saved by one author, and published by another author, in the same second. My feeling is that's rare enough to be acceptable, particularly since it's not exactly data corruption (_versioned table stay as-is), but rather inaccurate auxiliary info that's not used to modify any data. The only usage of VersionSnapshot tables is to efficiently roll up changes in order to mark owners as modified, and provide an activity feed.

I'm suggesting a VersionSnapshot table with the following structure

A few things:

  • I think it would be good to have a table where there was 1 record per UUID. So a pair of tables, say, VersionSnapshot and VersionSnapshotItem. It just seems like it would be a more coherent data-model.

  • You can use BIN(16) to store a UUID in binary, which I think would be valuable for performance, for something used as an index. http://php.net/manual/en/function.bin2hex.php might help with presenting them in a friendly way; if you wanted to get fancy you could do a custom DBField. Or, if you had a separate VersionSnapshot / VersionSnapshotItem table, you could just use a regular silverstripe PK (which has been suggested we shift to allowing for UUIDs in another ticket).

  • We'd probably also want to clarify the relationship between these snapshots and changesets, but maybe that can wait until this behaviour hits core. In general, I would see a changeset as less granular data structure, and so keeping both "snapshots" and "changesets" seems viable, but perhaps "changesets" could be refactored to reference the snapshots that they are destined to publish. Use git as an analogy, snapshot = commit and changeset = branch.

  • I'm still holding a candle for the VersionMin / VersionMax approach, but it can be a subsequent PR or something ;-)

Scenario 1 Step 2: C1 is published, creating new published version v32, and new versions up the ownership chain. B1 and A1 are no longer marked as modified.

My understanding is that publication shouldn't create a new version, but rather merely copy the existing version to the Live stage?

I'm further suggesting that we try to build this VersionSnapshot behaviour as a module, rather than committing to this in core while there's so many unknowns.

I've got mixed feelings about this, but I'd be okay if we did that planning from the get-go that this would be a temporary module and, say, in 4.5 we'd expect to refactor the code into core. I think that retaining this as a module in the longer-term would make it harder to use and maintain.


But TL;DR – go forth and build! 🎉

VersionSnapshot and VersionSnapshotItem table separation is fine, it has the added benefit of tracking separate creation dates for the entire snapshot, rather than multi-second windows for different objects.

Regarding UUIDs, I'm starting to wonder if that's overcooking it. You're just as likely to create table locks on concurrent writes as with the existing SiteTree table structures. In other words, if you're writing versioned objects, it's quite likely writing to SiteTree_* somewhere. Now it's also writing to VersionSnapshot. Maybe we'll keep the UUID aspects to the broader discussion at silverstripe/silverstripe-framework#8411.

Thinking about VersionSnapshot a bit more, I think we also need an deleted boolean. Actually, the full surface available on versioned tables: WasPublished, WasDraft, WasDeleted. If we ignore many-many relationships (and stick to many-many-through), I believe this would allow us to model the complete state of an ownership graph at any point in time. I've done a bit of data modeling based on the examples above in actual SQL, and it looks promising: https://gist.github.com/chillu/f98f75fc98d461dfe23574f5e6686198

Two nested sub selects aren't ideal, but even if we don't find a way to flatten this into joins, it's a max of three queries per item you want to check (e.g. when listing a page tree), as opposed to ~2x the amount of queries as you have owned object connections in the graph.

If I'm right and we can model the whole ownership graph, we could in theory use this for rollbacks. I can't see us using it for previews, since they need to preview un-owned versioned objects as well. Given we have already implemented a date-based rollback on ownership graphs which models what the preview does, I'm not sure how much value there is here.

Another comment on Sam's Element.PageVersionMin idea: Each object can have multiple owners, even owners of the same type. I don't see how we can normalise that into the database row of the owned object itself. We have discussed a tighter ownership ($owns + $cascade_deletes), but even that isn't an exclusive ownership. From my understanding, the system doesn't throw an exception of you configure this on two different types of owners for the same owned object.

I've added some ACs, can you all please check my understanding? @unclecheese has actually been battle testing my approach for a few days, has written out scenarios as "SQL unit tests", and performance testing with 1m+ rows. Looking good so far.

Does not mark owners as modified if an owned relation has been deleted from draft or live

Is this right? In my mind, even with cascade_delete, that would only unpublish or delete owned relationships, but it doesn't delete a owned relationship from live if it was previously deleted from draft. So since there's no way for a publication of the owner to change the deletion status of an owned relationship, it shouldn't show up as modified.

Is resilient to partial data sets (can't retroactively create this data)

We could create a version snapshot for the current ownership graph on any changed owned object, but there's no way to recreate this for historical versions. If we don't make a migration task for this, you'll have websites which inconsistently mark records as changed (nor not) based on the last time they've received a draft change in their ownership structure. This seems acceptable as a first cut, but we might be forced to write this task eventually.

Criterion "Does not significantly slow down save operations"

Performance testing was done with the following setup.

Page -> Block -> Image

One page has many blocks, each one of the blocks owns the image.
As such the image has multiple owners.

Here's some testing results

Without snapshots:

  • Page save: 2s
  • Create a new block: 5s
  • Save the image: 3s (1 owner) -> ?? (20 owners) -> 13s (40 owners) -> 18s (60 owners)

Snapshots installed:

  • Page save: 5s (40 blocks) -> 3s (60 blocks)
  • Page publish: 20s (40 blocks) -> 16s (60 blocks)
  • Create a new block: 8s
  • Save the image: 4s (1 owner) -> 16s (20 owners) -> 28s (40 owners) -> 44s (60 owners)

Notes

Without snapshots; Save the image: 3s (with 1 owner) -> 13s (with 40 owners) -> 18s (60 owners)

This may be indicating the tests aren't only including snapshots delta, but also other factors as well.

Result

Save operation (for the image asset). Owners are blocks owning to the image.

number of owners no snapshots with snapshots
1 owner 3s 4s
20 owners ?? 16s
40 owners 13s 28s
60 owners 18s 44s

Criterion "Does not significantly slow down save operations"

OK, I'd say that's good enough. The "image has 40 owners" case is a bit of an edge case anyway, so the fact that save operations on this already fairly complex codebase slows down from 13s to 28s isn't great, but acceptable for an alpha-level module. I think @unclecheese took the optimisations as far as he could, the next step would be to calculate snapshots asynchronously, which would then cause all kinds of fun edge cases - but might be a necessary evil for more complex projects. Let's get a bit more feedback in the wild before sinking more time into it.

It's obvious to developers reading up about versioned that this option exists (as an opt-in for now)

We still need to add some docs to versioned, right? It should point out the current limitations to developers, and then point to this module as an early stage approach.

For anyone looking back on this issue - this has been closed with the introduction of new experimental modules that relate versions to one-another.

See the docs: https://docs.silverstripe.org/en/4/developer_guides/model/versioning/
Or the base module: https://github.com/silverstripe/silverstripe-versioned-snapshots