Implement Split Distinct Aggregation for COUNT DISTINCT

Question

jacksonrnewhouse opened this issue a year ago · comments

Currently COUNT DISTINCT is done along a single key, which can become very expensive as the number of distinct elements within that key grows. Flink uses a bucketing method based on the hash of the key to distribute computation of distinct elements, which lets it scale out: https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/tuning/#split-distinct-aggregation. We should implement something similar.