crate / crate

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

Home Page:https://cratedb.com/product

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fdw: Add foreign data wrapper for parquet files (read-only)

mfussenegger opened this issue · comments

Problem Statement

Keeping all data forever in CrateDB can get expensive.
Deleting the data isn't an option, because it might still be needed, but it is not queried often.

Having an option to use cheaper storage option at the expense of query performance would be nice.

Possible Solutions

Make it possible to query Parquet files hosted on S3 via a foreign data wrapper/foreign table.

This is similar to: #15718

Downsides:

  • Needs some way to export the data to Parquet (#15772)
  • May be limited in terms of optimizations & query performance by the file format as we can't make changes to it. See also https://blog.lancedb.com/lance-v2/

Advantages:

  • Stable file format
  • File layout is optimized to fetch subset of columns. Via S3 Range requests it's possible to utilize this.

Considered Alternatives

Technical constraints

  • Should be implemented outside the server module and the hardcoded map in ForeignDataWrappers should be replaced via a service loader. This relates to #15720 - except that it can be completely undocumented/private, similar to other service loaders like for Functions

Open questions

  • Distributed execution/reads

Initial Scope (Estimate is only for this part)

  • Simple but slow version; No caching of downloaded data; Always reads from remote as neeeded; No attempts at operation push-down or partial result retrieval

Follow up (for later dedicated issues, not included in the first implementation)

  • Download only required data to minimize traffic; And maybe add cache for the remote data to avoid repeated downloads