Figure out what to do with `table_column` catalog table and bulk schema loading in general
gruuya opened this issue · comments
Currently we're not really using our Schema
for anything but the to_column_names_types
call when persisting the columns to the table_column
metadata table. So it's possible to remove that Schema
altogether and just use the underlying arrow_schema
call (though that could be extracted to a separate function).
On a more general level, we also currently don't use anything from our table_column
catalog table. When fetching a schema for a given table, such as in information_schema.columns
or when calling TableProvider::schema
somewhere in code (which is what DF uses for information_schema.columns
queries internally as well), we always rely on the Delta table's schema, which is ultimately reconstructed from the logs. The information_schema.columns
in particular will pose a problem at some point, see here
Lines 285 to 293 in 40b1158
The solution I outlined in that comment really encompasses adding an ability for bulk-loading Delta table schemas (which would involve changes in delta-rs and probably datafusion). A potentially better solution is for us to thinly wrap the delta table inside our own table and then use our own (bulk-loaded) catalog info in TableProvider::schema
, and only resolve TableProvider::scan
s using the wrapped Delta table. The main drawback there is the potential mismatch/and double tracking of schemas (in our catalog and the delta logs), which might not be that bad.
There's also a minor matter of format; currently we store the fields using the unofficial arrow json representation, while our storage layer has it's own schema/field types. There's also a possibility we'll want to introduce our own field format (to facilitate better compatibility with Postgres?), so wrapping the Delta table in that case would make even more sense.
I've also come to realize that the unofficial json representation is probably not robust/forward-compatible enough, and we should probably just migrate to serde::Serialize/Deserialize
for the Schema
/Field
, which is not equivalent: apache/arrow-rs#2876