Share table

Question

Share table

helloghworld opened this issue a year ago · comments

This is the schema I used for testing:

table TableType
{
    Id : string;
}

table MyTable (fs_serializer)
{
    InnerTable1 : TableType;
    InnerTable2 : TableType;
}

C# code:

MyTable myTable = new MyTable();
TableType innerTable1 = new TableType
{
    Id = "1"
};

myTable.InnerTable1 = innerTable1;
myTable.InnerTable2 = innerTable1;

int maxSize = MyTable.Serializer.GetMaxSize(myTable);
byte[] buffer = new byte[maxSize];
int bytesWritten = MyTable.Serializer.Write(buffer, myTable);
            
MyTable output = MyTable.Serializer.Parse(buffer);
int hash1 = output.InnerTable1.GetHashCode();
int hash2 = output.InnerTable2.GetHashCode();

Result: bytesWritten is 62, hash1 is different from hash2.
Some extra info:
If assign null to myTable.InnerTable2, the bytesWritten would be 34.
If assign a new TableType instance other than innerTable1 to myTable.InnerTable2, the bytesWritten is still 62.
I also tested the table array and got the similar result. Keep adding same table to table array caused the bytesWritten increase rapidly.

I've performed similar test with FlatBuffer and only extra 4 byte being allocated for each field that point to same table, and the hash are the same. Does Flatsharp expect to work like this or am I using it in a wrong way?

@jamescourtney

James Courtney · Answer 1 · Thu Jun 22 2023 15:55:06 GMT+0800 (China Standard Time)

[ Deleted previous response as I misunderstood your question ]

What you're looking for is object deduplication, which is a feature FlatSharp does not currently support. Allow me to shed some light as to why this is not supported.

Internally, FlatBuffers stores references (tables, vectors, strings, unions) as the uoffset datatype. uoffset is a 32-bit unsigned integer. The implication here is that if a table needs to be shared, it must be written to the right of all of the references to it.

This naturally happens for Google's FlatBuffer implementation because you build the buffer yourself going from right-to-left, and the Offset<T> object is reusable.

FlatSharp builds from left-to-right. This was a decision I made very early on. FlatSharp's stated goal is to be idiomatic and to fit in naturally with the rest of the C# ecosystem. It is unusual to see arrays with left padding in C#, so FlatSharp's buffers always start at the left. This allows the API to be pretty natural:

int bytesWritten = SomeType.Serializer.Write(buffer, object);

If I had it to do over again today, I would reevaluate the buffer ordering decision for a few reasons:

More APIs support Span<T> now, so working with left-padded arrays is more natural and easy
FlatSharp's serialization code can be a bit awkward, since it has to reserve space for the uoffset, finish writing the table, write the next items, and then fill in the uoffset later.
Improving string deduplication, which is kind of hacky today.

Assuming FlatSharp did go right-to-left, I'm a little bit suspicious that object deduplication will perform particularly well. Basically, when you use FlatBufferBuilder, you know that this particular table is duplicated. FlatSharp is a higher-level abstraction and would need to perform that test for every single table in the form of a Dictionary lookup, which despite being O(1) would slow FlatSharp down tremendously.

Your example was a little contrived. One way I've seen people work around this is to refer to an index of a well-known vector as a "pointer" to a shared table:

table Root
{
    Items : [ Item ];
}

table Item
{
    Sub_Item_Index : int; // references the items vector from root
}

helloghworld · Answer 2 · Sun Jun 25 2023 18:50:01 GMT+0800 (China Standard Time)

Thanks for the comprehensive reply. Using index or id to remove the duplication is also one of the solutions I came up with. But that increases the complexity, it's more natural to be able to reference table directly.