arxanas / git-branchless

High-velocity, monorepo-scale workflow for Git

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support undoing staged changes

arxanas opened this issue · comments

Motivation

git undo can't undo everything, but we can get close. As of c3f4f3c, we can run arbitrary code after a git command (assuming that the user has set up an alias). So it should be possible to track staged changes after each git command, and allow them to be undone as well.

The end goal is that we should be able to drop back into a merge conflict resolution (as part of a rebase or merge) and continue from there, if we decided that we did it wrong previously. That is, you should be able to run git undo, and then run git status and see a file marked as "both modified" which needs to be resolved. Then you should be able to run e.g. git rebase --continue to resolve the conflict and continue rebasing.

Background

Git's object database is a content-addressed key-value store. The key is the hash of the content, and the value can be one of these types of objects:

  • Blob: a chunk of binary data (e.g. a file).
  • Tree: a mapping from path to object, representing a hierarchical structure. The objects can be blobs, trees, or sometimes commits (in the case of Git submodules). The objects may also be annotated with some restricted metadata, like the Unix file permissions.
  • Commit: the aggregate combination of a tree, commit message, and list of parent commit objects.
  • Tag: not relevant here.

The index is an on-disk structure which is used to liaise between the working copy and the repository contents. It contains a sorted list of files (but no directories!), each of which has the following interesting attributes for our purposes:

  • The file's last-modified time, used to optimize index refreshes.
  • Object ID: the OID for the file (a blob).
  • Stage: a value from 0-3. The 0 value corresponds to a regular file, and values 1-3 correspond to the file in a state of merge conflict. See man git-merge for more details. It's possible for the same file path to exist multiple times in the index with different stage values.

Since the index stores the merge stage, it contains more information than can be represented straightforwardly in a tree object, which means that storing snapshots of the index will have to be more clever.

Design

Data

The idea is to store staging events in the event-log database. The important information is Git's index file, plus any auxiliary status files under the .git directory (like the rebase plan) that may be present. Collectively, let us call these files a "stage".

We can reuse Git's object database to store snapshots of the relevant files in the stage. None of these files are tracked by the repository normally.

However, the index file may be too big to copy and store into the object database every time we run a command (tens of megabytes in practice). So instead of storing the entire index, we'll store the parent commit ID, plus the differences from that commit's tree to the current index. Experiments indicate that calculating the diff doesn't take too long (<1s on my maxed-out 2019 Macbook), and we can also optimize by not calculating the diff if the index doesn't appear to have been modified since the most recent stage.

There can only be one stage active for a given worktree at a time.

Logically, a stage which we're persisting has these fields:

  • Timestamp (to compare to the index file on disk for optimization).
  • Parent commit OID.
  • A tree containing:
    • An index file which represents a diff from the parent commit.
    • All the auxiliary files.
    • Any blobs referenced by the index file diff (to keep them live for garbage collection).
  • Possibly a handle to the previous stage (temporally-speaking).

There is technically no requirement that the tree be associated with a commit of its own, but the easiest implementation would create a commit for each stage with the logical base commit as its parent. It would then automatically render in the correct place in the smartlog, and we could ensure the stage is kept live by the existing garbage collection integration.

Events

In the event log, we'll store stage events which contain the above fields. The active stage for a given point in time is determined by the most recent stage event. This contains the parent commit plus its diff to be able to recreate the stage.

There is no Git hook which triggers when the index changes. For now, the best we can do is wrap the git command and check the stage after each command invocation, and store an event if it appears to have changed. Note that since the stage contains a reference to the parent commit, then if the checked-out commit changes, we will need to emit another stage event (or derive the changed parent commit after the fact).

It would also be reasonable to use the same kind of event to track both commits and stages.

Undo

The inverse of a stage event is the stage event which preceded it. Unlike all other events at the time of this writing, this inverse event is not determined entirely by a given event's contents; it is sensitive to the events which have happened before it.

To undo to a previous stage, we need to check out to the corresponding commit, and then apply the diff. This means overwriting files in the .git directory. After doing this, git status will show that the working copy files are different from their staged versions. Unless we commit to also tracking unstaged changes (!), we can't restore the working copy files to their old versions. We should either leave them untouched, or update them to their staged versions, or possibly attempt to merge them together.