There is data skew in hash shuffle

Question

There is data skew in hash shuffle

lgbo-ustc opened this issue a year ago · comments

lgbo commented a year ago

You have to provide the following information whenever possible.

Describe what's wrong

A clear and concise description of what works not as it is supposed to.

A link to reproducer in https://fiddle.clickhouse.com/.

We run a aggregation on high cardinality keys, and found data skew.
We notice that the hash function's behavior is different in Clickhouse and Spark in dealing with nulls.
In spark

select hash(null);
+-------------+
| hash(NULL)  |
+-------------+
| 42          |
+-------------+
1 row selected (0.089 seconds)

select hash(1,null);
+----------------+
| hash(1, NULL)  |
+----------------+
| -559580957     |
+----------------+
1 row selected (0.129 seconds)

In clickhouse

SELECT cityHash64(1, NULL)


┌─cityHash64(1, NULL)─┐
│ ᴺᵁᴸᴸ                │
└─────────────────────┘

When the hash keys have nulls, it will cause data skew easly.

Does it reproduce on recent release?

The list of releases

Enable crash reporting

If possible, change "enabled" to true in "send_crash_reports" section in config.xml:

<send_crash_reports>
        <!-- Changing <enabled> to true allows sending crash reports to -->
        <!-- the ClickHouse core developers team via Sentry https://sentry.io -->
        <enabled>false</enabled>

How to reproduce

Which ClickHouse server version to use
Which interface to use, if matters
Non-default settings, if any
CREATE TABLE statements for all tables involved
Sample data for all these tables, use clickhouse-obfuscator if necessary
Queries to run that lead to unexpected result

Run a aggregation query on on high cardinality keys, and the keys has nulls.

Expected behavior

A clear and concise description of what you expected to happen.

Error message and/or stacktrace

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.