diffix / pg_diffix

Implementation of the Open Diffix anonymization mechanism for PostgreSQL.

Home Page:https://www.open-diffix.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`COUNT(DISTINCT)` aggregator is very slow.

cristianberneanu opened this issue · comments

cristian=# \timing
Timing is on.
cristian=# CREATE TABLE test_large AS (
cristian(#   SELECT
cristian(#     i AS id, left(md5(random()::text), 4) AS t
cristian(#   FROM generate_series(1, 100000) series(i)
cristian(# );
SELECT 100000
Time: 245.086 ms
cristian=# SECURITY LABEL FOR pg_diffix ON TABLE test_large IS 'sensitive';
SECURITY LABEL
Time: 9.695 ms
cristian=# SECURITY LABEL FOR pg_diffix ON COLUMN test_large.id IS 'aid';
SECURITY LABEL
Time: 5.123 ms
cristian=# SELECT count(DISTINCT t) FROM test_large;
 count
-------
 51171
(1 row)

Time: 49.442 ms
cristian=# SET pg_diffix.session_access_level = 'publish_trusted';
SET
Time: 0.284 ms
cristian=# SELECT count(DISTINCT t) FROM test_large;
 count
-------
 51168
(1 row)

Time: 112256.866 ms (01:52.257)

Also ignores cancel requests.

Also ignores cancel requests.

Nice find, but better to address it in a separate issue.

Looks like 99% of the slowdown happens in final agg. The reference implementation afaik is quite fast. We should mirror that behavior.