`COUNT(DISTINCT)` aggregator is very slow.

Question

`COUNT(DISTINCT)` aggregator is very slow.

cristianberneanu opened this issue 2 years ago · comments

cristian=# \timing
Timing is on.
cristian=# CREATE TABLE test_large AS (
cristian(#   SELECT
cristian(#     i AS id, left(md5(random()::text), 4) AS t
cristian(#   FROM generate_series(1, 100000) series(i)
cristian(# );
SELECT 100000
Time: 245.086 ms
cristian=# SECURITY LABEL FOR pg_diffix ON TABLE test_large IS 'sensitive';
SECURITY LABEL
Time: 9.695 ms
cristian=# SECURITY LABEL FOR pg_diffix ON COLUMN test_large.id IS 'aid';
SECURITY LABEL
Time: 5.123 ms
cristian=# SELECT count(DISTINCT t) FROM test_large;
 count
-------
 51171
(1 row)

Time: 49.442 ms
cristian=# SET pg_diffix.session_access_level = 'publish_trusted';
SET
Time: 0.284 ms
cristian=# SELECT count(DISTINCT t) FROM test_large;
 count
-------
 51168
(1 row)

Time: 112256.866 ms (01:52.257)

Edon Gashi · Answer 1 · Tue Apr 05 2022 00:24:57 GMT+0800 (China Standard Time)

Also ignores cancel requests.

Cristian Berneanu · Answer 2 · Tue Apr 05 2022 00:31:15 GMT+0800 (China Standard Time)

Also ignores cancel requests.

Nice find, but better to address it in a separate issue.

Edon Gashi · Answer 3 · Tue Apr 05 2022 00:37:13 GMT+0800 (China Standard Time)

Looks like 99% of the slowdown happens in final agg. The reference implementation afaik is quite fast. We should mirror that behavior.