dolthub / dolt

Dolt – Git for Data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Custom/Empty/Manual Merge

timsehn opened this issue · comments

A customer we talked to overrides Dolt's merge logic because they want to implement custom rules on merges that Dolt does not give visibility into. For instance, for their use case, a data change and a schema change on the same table is "a conflict" and they prompt their users to resolve. The example they gave was a row addition and an added column where the user must fill in the new column for their new row in order to complete the merge.

To resolve this use case, they now make the schemas match in one commit and then do a data merge, leaving two commits in their log, which is not ideal.

This brought up a number of interesting feature requests, but for this issue I will focus on the "custom merge".

  1. The user would love to be able to open an "empty merge" and pick the working set (left, right, or "dolt merged") to be the edit base. Then they would like to make edits and commit the merge. --squash is the "dolt merge" version but this feature request is to choose the appropriate merge base.
  2. They would like appropriate status functions like dolt_is_merging() to manage the state of their merge.
  3. When the working set is ready, they would like to then make a commit and have this commit show up in their logs. This is already what happens with --squash

I like this. This feels reminiscent of several other git/sql features, while also being distinct.

Specifically, it feels like:

  • Git's merge drivers feature
  • Dolt's conflict system tables
  • SQL's "ON DUPLICATE KEY" with row aliases
  • SQL migration scripts used for migrating application data to a new application version with a new table schema

Some of this behavior is already implemented by the above. The proposed dolt_is_merging() function is akin to the dolt_merge_status system table.

The purpose behind the custom merge workflow would be, like Git's merge drivers, to produce a commit that has multiple parents, while giving the user more control over the result of the merge. Additional hooks allow the operation to automatically resolve conflicts in a custom way, or flag a change as a conflict even if it normally wouldn't be.

Instead of having to choose the "edit base" for the custom merge though, I think we can approach this in a more SQL-expression-like way by adding functionality to Dolt's conflict system tables.

For example, imagine a user script that looks like this:

DOLT_MERGE("otherBranch", "--interactive");
--If there are schema conflicts, handle them here. Example:
UPDATE dolt_schema_conflicts SET merged_schema = our_schema WHERE table_name = "table1";
UPDATE dolt_schema_conflicts SET merged_schema = their_schema WHERE table_name != "table1";
DOLT_MERGE("--resolve-schema");
-- Next, we resolve the data conflicts by updating dolt_merge_test (which contains a row for every changed row.)
UPDATE dolt_conflicts_table1 SET b = merged_b;
UPDATE dolt_conflicts_table2 SET b = their_b;
-- For table 3, we want to report a conflict if both sides changed
UPDATE dolt_conflicts_table3 SET b = CONFLICT() WHERE our_diff_type IS NOT NULL AND their_diff_type IS NOT NULL;
DOLT_MERGE("--resolve-data");

This is normal SQL that the engine can analyze and optimize.

Basically, the --interactive flag does two things:

  1. The merge will pause before merging schemas and before merging data, regardless of whether there are unresolvable conflicts or not. This gives clients an opportunity to inspect and alter the data, abort even if there's no unresolvable conflicts, manually handle resolvable conflicts, and prevent expensive merge operations if the client won't actually use the result. Upon continuing, it will merge the remaining schemas (for the first continue) or remaining data (for the second continue)
  2. Create system tables (that exist for the duration of the merge) that surface information about the three-way diff and allow the user to make the necessary decisions. These can be an altered version of the conflict system tables, or different system tables.

Prior to the call to DOLT_MERGE("--resolve-schema"), dolt_schema_conflicts contains a row for every modified table schema. During DOLT_MERGE("--resolve-schema"), the engine will attempt to resolve conflicts, but will skip any rows that were assigned to by the previous update statements.

Prior to the call to DOLT_MERGE("--resolve-data"), dolt_conflicts_$tablename contains a row for every modified row in the table. During DOLT_MERGE("--resolve-data"), the engine will attempt to resolve conflicts, but will skip any rows that were assigned to by the previous update statements. Exception:Setting a value to the special CONFLICT() function means that when we attempt to resolve, that row will report a conflict that must be manually resolved even if the engine could automatically resolve it.

Note that this actually fixes multiple current shortcoming in schema merges, where we currently only support resolving them by choosing --ours or --theirs, which applies to every conflicting table, and also makes that same choice for the data, not just the schema.

If we want this behavior to be automatic whenever the user runs dolt merge, we could wrap it in a special stored procedure that gets used as a hook:

CREATE PROCEDURE DOLT_MERGE_DRIVER()
BEGIN
    --If there are schema conflicts, handle them here. Example:
    UPDATE dolt_schema_conflicts SET merged_schema = our_schema WHERE table_name = "table1";
    UPDATE dolt_schema_conflicts SET merged_schema = their_schema WHERE table_name != "table1";
    DOLT_MERGE("--resolve-schema");
    -- Next, we resolve the data conflicts by updating dolt_merge_test (which contains a row for every changed row.)
    UPDATE dolt_conflicts_table1 SET b = merged_b;
    UPDATE dolt_conflicts_table2 SET b = their_b;
    -- For table 3, we want to report a conflict if both sides changed
    UPDATE dolt_conflicts_table3 SET b = CONFLICT() WHERE our_diff_type IS NOT NULL AND their_diff_type IS NOT NULL;
    DOLT_MERGE("--resolve-data");
END

What do we think?

Elaborating a bit more on the stored procedure concept, and what it means conceptually.

Right now we can imagine an ongoing merge being in a couple different states:

  • 0: not started
  • 1: resolving schema conflicts
  • 2: resolving data conflicts
  • 3: finished

If there are conflicts that the engine can't resolve, the merge pauses and waits for the user to call merge --continue

Essentially, merge --interactive adds two new possible states:

  • 0: not started
  • 1: inspecting schema diff (skipped unless interactive)
  • 2: resolving schema conflicts
  • 3: inspecting data diff (skipped unless interactive)
  • 4: resolving data conflicts
  • 5: finished

During (1) and (3), the user (or a stored procedure) is inspecting the schema diff, and for each cell either flagging a conflict, setting a new value, or ignoring it. If flagged, then the row will need to be manually resolved in the subsequent step, or the merge is aborted. If ignored, then the engine will attempt to automatically resolve in the next step.

In that case, it probably makes more sense to have two stored procedures: DOLT_MERGE_SCHEMA_DRIVER and DOLT_MERGE_DATA_DRIVER. If these stored procedures exist, then a non-interactive merge will call them in place of steps (1) and (3).

As for the new system tables needed, they're basically views into the 3-way diff. They wouldn't be stored in memory, and executing the UPDATE statement coincides with iterating over the diff.

I suspect that this can be used to achieve most/all of the modes described in #7681, although we might still want to support those since they're simpler to use.

Another possible approach then might be to just allow users to define a stored function for each table that takes the three rows and returns the merged row (or CONFLICT()) and just call that for each modified row during merging. That might be simpler but still have most the same benefits.