automerge / autosurgeon

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Removing outer elements in nested `Vec` produces unexpected results

xlambein opened this issue · comments

I've noticed what I think is a bug in autosurgeon.

In the test that follows, I create a data structure with a Vec of 3 strings, fork it, remove one element in one document and another in the other document, then merge them back and get a document where both elements are removed, as expected:

#[test]
fn remove_vec_of_strings() {
    #[derive(Hydrate, Reconcile)]
    struct Data {
        rows: Vec<String>,
    }

    // Create data with 3 rows
    let mut data1 = Data {
        rows: vec!["hello".to_owned(), "world".to_owned(), "foobar".to_owned()],
    };
    let mut doc1 = AutoCommit::new();
    reconcile(&mut doc1, &data1).unwrap();

    // Fork into another document
    let mut doc2 = doc1.fork();
    let mut data2: Data = hydrate(&doc2).unwrap();

    // Remove row 0 in first document, and row 1 in second
    data1.rows.remove(0);
    data2.rows.remove(1);
    reconcile(&mut doc1, data1).unwrap();
    reconcile(&mut doc2, data2).unwrap();

    // Merge documents
    doc1.merge(&mut doc2).unwrap();
    let data_merged: Data = hydrate(&doc1).unwrap();

    // Both rows 0 and 1 have been removed
    assert_eq!(data_merged.rows.len(), 1);
    assert_eq!(&data_merged.rows[0], "foobar");
}

However, in the next test, I do the exact same, but the document contains a Vec of three Vecs of bytes. Here, the results are different: we get a document with two elements, in which some values are scrambled. Note that I'm aware of the existence of ByteVec, but for this example I'm not using them.

#[test]
fn remove_vec_of_bytes() {
    #[derive(Hydrate, Reconcile)]
    struct Data {
        rows: Vec<Vec<u8>>,
    }

    // Same as above
    let mut data1 = Data {
        rows: vec![b"hello".to_vec(), b"world".to_vec(), b"foobar".to_vec()],
    };
    let mut doc1 = AutoCommit::new();
    reconcile(&mut doc1, &data1).unwrap();

    let mut doc2 = doc1.fork();
    let mut data2: Data = hydrate(&doc2).unwrap();

    data1.rows.remove(0);
    data2.rows.remove(1);

    reconcile(&mut doc1, data1).unwrap();
    reconcile(&mut doc2, data2).unwrap();
    doc1.merge(&mut doc2).unwrap();
    let data_merged: Data = hydrate(&doc1).unwrap();

    // There are two rows, with the following values:
    assert_eq!(data_merged.rows.len(), 2);
    assert_eq!(&data_merged.rows[0], b"world");
    assert_eq!(&data_merged.rows[1], b"ffoobaobar");
    // Instead, I'd expect to have the same results as the previous test
}

I tried doing things manually with automerge, i.e. creating a document with nested lists, forking, removing elements, and merging again, but I don't get any weird results there, which leads me to conclude that autosurgeon's implementation must have an issue somewhere---or perhaps that I misunderstood something.

Same test again, using `automerge` directly:
#[test]
fn remove_vec_of_bytes_automerge() {
    let mut doc1 = AutoCommit::new();
    let rows = doc1
        .put_object(automerge::ROOT, "rows", automerge::ObjType::List)
        .unwrap();

    let row = doc1
        .insert_object(&rows, 0, automerge::ObjType::List)
        .unwrap();
    for (i, b) in b"hello".into_iter().enumerate() {
        doc1.insert(&row, i, *b as u64).unwrap();
    }
    let row = doc1
        .insert_object(&rows, 1, automerge::ObjType::List)
        .unwrap();
    for (i, b) in b"world".into_iter().enumerate() {
        doc1.insert(&row, i, *b as u64).unwrap();
    }
    let row = doc1
        .insert_object(&rows, 2, automerge::ObjType::List)
        .unwrap();
    for (i, b) in b"foobar".into_iter().enumerate() {
        doc1.insert(&row, i, *b as u64).unwrap();
    }

    let mut doc2 = doc1.fork();

    doc1.delete(&rows, 0).unwrap();
    doc2.delete(&rows, 1).unwrap();
    assert_eq!(doc1.length(&rows), 2);
    assert_eq!(doc2.length(&rows), 2);

    doc1.merge(&mut doc2).unwrap();
    assert_eq!(doc1.length(&rows), 1);
    assert_eq!(
        doc1.list_range(doc1.get(&rows, 0).unwrap().unwrap().1, ..)
            .map(|(_, value, _)| value.to_u64().unwrap() as u8)
            .collect::<Vec<_>>(),
        b"foobar"
    );
}

This is due to a possibly surprising feature of the Reconcile implementation for vectors. If the Reconcile implementation of the type of the elements in the vector does not specify a key then we diff structurally. This means that for each item in the incoming sequence (i.e. the new data we are putting into the document) we attempt to reconcile it with the value at the same location in the sequence which is already in the document (if there are more items incoming then we insert them after the existing ones).

"reconcile with the value at the same location" will do different things depending on the Reconcile implementation of the incoming element. In the case of nested sequences as presented here this will then do a Hunt-Szymanski diff in the two sequences (because the contained elements are scalars and therefore they do have a Reconcile::Key).

Now, you can avoid this behavior by adding a key for the nested sequences, this would require a newtype for the nested Vec<u8> along with an implementation of Reconcile.

This is clearly surprising behavior, I'm not sure what a better default would be though. It's possible that in almost all cases you will want a key and so we should make not providing a key an opt in (I haven't thought through in practice how that would be done, or if it can be done with the current Reconcile trait).

I see, thanks for the explanation! I better understand the problem now.

As you said, I'm not sure either if there's a better default. Perhaps it would be enough to document this more, although to be fair, re-reading the documentation for Reconcile, it clearly states that a key is necessary for vectors to be reconciled properly. I just hadn't really made the connection to this problem.

By the way, digging more into this, I realized that Uuids in the uuid feature don't have a key (because ByteVec doesn't either), which means that a Vec of them will not reconcile very well. I find that surprising, since all other fixed-sized types have one. If you agree, I'll open a PR for that :-)

I agree that Uuid should have a key and would welcome such a contribution 🙂