Reading and saving documents and blobs (re. PR#66)

Question

Reading and saving documents and blobs (re. PR#66)

Rudiksz opened this issue 4 years ago · comments

I'm pretty sure the map/dictionary is enough and probably prefereable. If a blob field has the same content_type and digest as the one already stored in database but no actual data, it's a no-op for the blob. The blob should be preserved. If you also send data then it's replaced.

Fixing this issue has implications in the way documents are read and saved to/from the database, and would significantly impact the platform code, hence this new discussion.

One of the issues I see currently is that the platform code is doing a lot of work parsing data both when reading and saving documents. Below is a working proof of concept that delegates most of the serialization/deserialization to the library itself, therefore greatly simplifying the plugin's code.

A short breakdown of the important bits:

The code below is entirely self-contained and sufficient to read and save documents, and it produces the exact same output as the old implementations. The Dart code requires no changes (except if we want to also handle the revision property, which is strangely missing now).
The SDK has methods which serialize/deserialize maps and they support all the types that the method channel's codec supports - except the Blobs. Therefore, except for transforming the Blob instances into dictionaries, there's no need to do any kind of custom parsing in the platform code.
When reading a document. The SDK has a 'document.toMap()' method, which, if it weren't for the blobs could be just passed to the method channel without any changes. the "_documentToMapNew(Document doc)" method replaces all Blob instances with their dictionary counterpart. The Blob class even has a method for it in 'getProperties()'.
When saving a document the situation is very similar. The map that the method channel passes to the Java code is the same that the MutableDocument's constructors expect - minus the blobs. Here the situation is a bit more complex but not by much.

If you set a @blob field with byte data in it, the MutableDocument needs a Blob instance instead of a simple dictionary.

Ex: this is a document that is being saved with a new blob
{age=10, languages=[en, es], active=true, id=person1, avatar={digest=null, @type=blob, length=null, data=[B@e3e7d9b, content_type=image/png}, settings={list=[1, 2, 3, 4, 5], a=b, x=y, map={1=one, 2=two, 3=three}}, doctype=person, name=Person1, height=5.6, birthday=2000-02-02T00:00:00.000}

If there are @blob fields in the dictionary, but their data field is empty, then the MutableDocument can deserialize it by himself and the blobs are preserved as expected. In other words, if there is no data sent, there is no need to convert the metadata into a Blob instance. With the new code that is accessing the blob files directly, this allows an important optimization on the dart side to only send data when the data is actually changed (like when uploading a new avatar). In every other case only the metada is ever sent. More importantly, the part that is relevant to the PR66 is that there is no need to ever keep track or cache the blobs in the platform side. It is the resposability of the dart code to ensure that the blobs in a document are preserved, updated or deleted according to the business logic.

Ex: this is a document that was read in dart using "database.document()' and saved again but without reading or writing the 'avatar' field.
{age=10, active=true, languages=[en, es], id=person1, avatar={digest=sha1-K5R8dtFHQYSI4daLqfJwXQgZb8k=, @type=blob, length=120, content_type=image/png}, settings={a=b, list=[1, 2, 3, 4, 5], x=y, map={1=one, 2=two, 3=three}}, doctype=person, height=5.6, name=Person1, birthday=2000-02-02T00:00:00.000}

Hint: The blob was preserved.

Currently this code assumes that the blobs are only at the first level of the document, and doesn't handle blobs in nested values. I'm not aware of any way to set blobs deeper in the document, but I'll try to investigate this further to confirm it.

     Map<String, Object> getDocumentWithIdNew(Database database, String _id) {
        HashMap<String, Object> resultMap = new HashMap<>();

        Document document = database.getDocument(_id);

        if (document != null) {
            resultMap.put("doc", _documentToMapNew(document));
            resultMap.put("id", document.getId());
            resultMap.put("rev", document.getRevisionID());
            resultMap.put("sequence", document.getSequence());
        } else {
            resultMap.put("doc", null);
            resultMap.put("id", _id);
        }

        return resultMap;
    }

    Map<String, Object> saveDocumentNew(Database database, Object id, Map<String, Object> data, ConcurrencyControl concurrencyControl) throws CouchbaseLiteException {
        // Blobs need special attention
        // When sending data we replace the value in the dictionary with a Blob instance.
        // When there is no data sent, we leave the blob's metadata dictionary as is
        // and let Couchbase Core to deserialize it. This will ensure the blobs are preserved.
        for (String key : data.keySet()) {
            Object value = data.get(key);
            if (value instanceof Map<?, ?>) {
                Map<?,?> v = (Map<?,?>) value;
                if (Objects.equals(v.get("@type"),"blob")) {
                    if (v.get("data") != null) {
                        String contentType = (String) v.get("content_type");
                        byte[] blobData = (byte[]) v.get("data");
                        data.put(key, new Blob(contentType, blobData));
                    }
                }
            }
        }

        MutableDocument mutableDoc;
        if (id != null && data != null) {
            mutableDoc = new MutableDocument(id.toString(), data);
        }
        else if (id == null && data == null) {
            mutableDoc = new MutableDocument();
        }
        else if (data == null) {
            mutableDoc = new MutableDocument(id.toString());
        }
        else {
            mutableDoc = new MutableDocument(data);
        }

        boolean success = database.save(mutableDoc, concurrencyControl);
        HashMap<String, Object> resultMap = new HashMap<>();
        resultMap.put("success", success);
        if (success) {
            resultMap.put("id", mutableDoc.getId());
            resultMap.put("sequence", mutableDoc.getSequence());
            resultMap.put("rev", mutableDoc.getRevisionID());
            resultMap.put("doc", _documentToMapNew(mutableDoc));
        }
        return resultMap;
    }


    private Map<String, Object> _documentToMapNew(Document doc) {
        Map<String, Object> map = doc.toMap();

        // Replace all Blob instances with their "json" metadata
        for (String key : map.keySet()) {
            if (map.get(key) instanceof Blob) {
                Map<String,Object> json =  ((Blob) map.get(key)).getProperties();
                json.put("@type", "blob");
                map.put(key,json);
            }
        }
        return map;
    }

I have an example app with test code, that I'm cleaning up and I'll try to share in the next few days. I just wanted to put this out there for feedback.

Similar concepts can be used in handling query results too, but that's for another day. Queries return every document in the result set twice, which is not only very inefficient, but also is diverges from the API.

Bryan Welter · Answer 1 · Wed Aug 12 2020 09:07:47 GMT+0800 (China Standard Time)

I pushed to demo branch where I only create Blob objects now if they are new blobs otherwise its always a Map/Dictionary Object this way you can more easily test the proposed solution. If it doesn't work then simply we have to keep the cache but if we it does work then we would just have to find a way to get what database or path the blob is from. Let me know your thoughts and I was also thinking it may be cleaner to just have a method in the blob called getContentFromDatabase, I included the code in the demo branch.

Also in order to set Blob objects deeper into the Documents we need to implement MutableFragments like the swift code has. Currently the platform code does support this so the only coding needed is dart side code.

I will see what I can come up with on the result sets later.

Rudolf Martincsek · Answer 2 · Sat Aug 15 2020 10:17:58 GMT+0800 (China Standard Time)

I checked the changes and seems to be working fine, except the minor caching issues.

Nested blobs are way down on my list of priorities. I can imagine a few cases where they could be nice, but it's nothing I can't easily solve otherwise.

I have made a commit on my repository with a couple of changes that I was working on the past few days regarding the json parsing of the documents and Results. It's only Java for now, because the changes are quite extensive (well, really I just removed a bunch of code) and there's no point for me to rewrite the Swift code if I can't test it.

If you want to take a look at the changes I did
https://github.com/Rudiksz/couchbase_lite/tree/cleanup

The biggest motivation for these changes was to make the plugin as close to the SDK as possible, specially when queries are handled, and mae it do as little work as possible.

Currently the plugins returns every document twice, which is quite wasteful. I understand that it was done to preserve the SDK api's toList method but that can be implemented in Dart by simply calling the stored map's "values" method. It effectively doubles the amount of data passed through the channels.

Old query results:

[
  {
    "keys": [
      "test"
    ],
    "list": [
      {
        "name": "Person1",
        "id": "person1",
        "doctype": "person",
        "avatar": {
          "digest": "sha1-K5R8dtFHQYSI4daLqfJwXQgZb8k=",
          "@type": "blob",
          "length": 120,
          "content_type": "image/png"
        }
      }
    ],
    "map": {
      "test": {
        "name": "Person1",
        "id": "person1",
        "doctype": "person",
        "avatar": {
          "digest": "sha1-K5R8dtFHQYSI4daLqfJwXQgZb8k=",
          "@type": "blob",
          "length": 120,
          "content_type": "image/png"
        }
      }
    }
  },
  {
    "keys": [
      "test"
    ],
    "list": [
      {
        "name": "Person2",
        "id": "person2",
        "doctype": "person"
      }
    ],
    "map": {
      "test": {
        "name": "Person2",
        "id": "person2",
        "doctype": "person"
      }
    }
  }
]

These are the new query results, which is exactly how you would get them in Java. Dart's Result class api is unchanged and shouldn't cause any code breakages.

[
  {
    "test": {
      "name": "Person1",
      "id": "person1",
      "doctype": "person",
      "avatar": {
        "digest": "sha1-K5R8dtFHQYSI4daLqfJwXQgZb8k=",
        "@type": "blob",
        "length": 120,
        "content_type": "image/png"
      }
    }
  },
  {
    "test": {
      "name": "Person2",
      "id": "person2",
      "doctype": "person"
    }
  }
]

There's a small test app to see how the client code would work, that works with both the current beta branch and my cleanup branch.
https://gist.github.com/Rudiksz/66ad80de847b08a3765f8e82044e5cf0

Bryan Welter · Answer 3 · Sat Aug 15 2020 14:14:20 GMT+0800 (China Standard Time)

The reason it was done this way is for cases when you have missing vs null values which result in less keys than number of select expressions. So size of the maps will be less than the size of the list representing the select expressions.

Rudolf Martincsek · Answer 4 · Sun Aug 16 2020 03:52:45 GMT+0800 (China Standard Time)

I'm not saving null values in the database, neither am I reading the results as a list. Couchbase is a key-value store and it's one of the major factors I choose it.

With that being said I did investigate a bit. As it turns out the issue is again the platform code doing its own parsing.

Doing the following type of query (which is really the only case null values can appear in result sets)

    // Test query
    var query = QueryBuilder.select([
      SelectResult.expression(Expression.property("name")),
      SelectResult.expression(Expression.property("active")),
      SelectResult.expression(Expression.property("height")),
      SelectResult.expression(Expression.property("age")),
    ])
        .from(db.name)
        .where(
            Expression.property("doctype").equalTo(Expression.string("person")))
        .limit(Expression.intValue(10));

The following is the output.
On Java side the Result object returns the following. Notice that the map respresentation contains the null keys.

(Java) result.toMap(): {name=Person2, height=5.6, active=1, age=10}
(Java) result.toList(): [Person2, 1, 5.6, 10]

(Java) result.toMap(): {name=Person33479500, height=null, active=null, age=76}
(Java) result.toList(): [Person33479500, null, null, 76]

However Dart receives the following:

(Dart) result: {keys: [height, name, active, age], list: [Person2, 1, 5.6, 10], map: {name: Person2, height: 5.6, active: 1, age: 10}}
(Dart) result: {keys: [height, name, active, age], list: [Person33479500, null, null, 76], map: {name: Person33479500, age: 76}}

Somehow the "map" value lost the null values in translation. They probably got removed by the various parsing functions.
Simply returning whatever java's "result.toMap()" method produces, would return the proper dictionary for Dart, as seen below with the new code.

Dart result.map: {name: Person2, height: 5.6, active: 1, age: 10}
Dart result.map: {name: Person1647668, height: null, active: null, age: 95}

In Dart, calling "map.values.toList()" on this dictionary produces the expected value.

[Person2, 5.6, 1, 10]
[Person95618260, null, null, 50]

Except for parsing the blobs, there's simply no reason to do any kind of parsing in the platform code. And there's certainly no need to send every result twice. On queries that return 2-300 documents, suddenly we are talking about 4-600 documents that need to be constantly serialized/deserialized.

The plugin should return whatever Couchbase returns and allow the developers to deal with parsing the results in whatever way it makes sense for the application.