Query trillion row datasets in Object Storage for a few cents using ClickHouse.
We aim to test the cost efficiency and performance of ClickHouse in querying files in object storage.
To this end, this repository contains Pulumi code to deploy a ClickHouse cluster in Cloud providers of a specified instance type, run a configured query against object storage, and shut the cluster down. The objective is to ensure this cost is as low as possible. In most cases (assuming pricing is linear), this should also mean faster queries.
For each cloud provider, the approach can differ, e.g., for AWS, we use spot instances.
- AWS - AWS using configurable spot instances.
Any query should not require data to be loaded into ClickHouse i.e. it should query data in object storage via functions such as the s3Cluster function. The query is configurable for providers.
Available at s3://coiled-datasets-rp/1trc
. Requires requester to pay. This can be queried as shown below:
SELECT * FROM s3Cluster('default','https://coiled-datasets-rp.s3.us-east-1.amazonaws.com/1trc/measurements-*.parquet', '<AWS_ACCESS_KEY_ID>', '<AWS_SECRET_ACCESS_KEY>', headers('x-amz-request-payer' = 'requester'))
To avoid data transfer costs, ensure you query from us-east-1
.
For an example, see ClickHouse and The One Trillion Row Challenge. This queries 1 trillion rows for $0.56 in S3.
The original work was inspired by https://github.com/coiled/1trc which in turn was inspired by Gunnar Morling's one billion row challenge.
Contributions are welcome to improve the code for a provider. This can include making providers more flexible or ensuring resources are deployed and destroyed faster.
For simplicity, we request all orchestration codes be in Pulumi.