ArroyoSystems / arroyo

Distributed stream processing engine in Rust

Home Page:https://arroyo.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement Split Distinct Aggregation for COUNT DISTINCT

jacksonrnewhouse opened this issue · comments

Currently COUNT DISTINCT is done along a single key, which can become very expensive as the number of distinct elements within that key grows. Flink uses a bucketing method based on the hash of the key to distribute computation of distinct elements, which lets it scale out: https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/tuning/#split-distinct-aggregation. We should implement something similar.