fdw: Add foreign data wrapper for parquet files (read-only)
mfussenegger opened this issue · comments
Problem Statement
Keeping all data forever in CrateDB can get expensive.
Deleting the data isn't an option, because it might still be needed, but it is not queried often.
Having an option to use cheaper storage option at the expense of query performance would be nice.
Possible Solutions
Make it possible to query Parquet files hosted on S3 via a foreign data wrapper/foreign table.
This is similar to: #15718
Downsides:
- Needs some way to export the data to Parquet (#15772)
- May be limited in terms of optimizations & query performance by the file format as we can't make changes to it. See also https://blog.lancedb.com/lance-v2/
Advantages:
- Stable file format
- File layout is optimized to fetch subset of columns. Via S3 Range requests it's possible to utilize this.
Considered Alternatives
Technical constraints
- Should be implemented outside the server module and the hardcoded map in
ForeignDataWrappers
should be replaced via a service loader. This relates to #15720 - except that it can be completely undocumented/private, similar to other service loaders like forFunctions
Open questions
- Distributed execution/reads
Initial Scope (Estimate is only for this part)
- Simple but slow version; No caching of downloaded data; Always reads from remote as neeeded; No attempts at operation push-down or partial result retrieval
Follow up (for later dedicated issues, not included in the first implementation)
- Download only required data to minimize traffic; And maybe add cache for the remote data to avoid repeated downloads