There is data skew in hash shuffle
lgbo-ustc opened this issue · comments
You have to provide the following information whenever possible.
Describe what's wrong
A clear and concise description of what works not as it is supposed to.
A link to reproducer in https://fiddle.clickhouse.com/.
We run a aggregation on high cardinality keys, and found data skew.
We notice that the hash
function's behavior is different in Clickhouse
and Spark
in dealing with nulls.
In spark
select hash(null);
+-------------+
| hash(NULL) |
+-------------+
| 42 |
+-------------+
1 row selected (0.089 seconds)
select hash(1,null);
+----------------+
| hash(1, NULL) |
+----------------+
| -559580957 |
+----------------+
1 row selected (0.129 seconds)
In clickhouse
SELECT cityHash64(1, NULL)
┌─cityHash64(1, NULL)─┐
│ ᴺᵁᴸᴸ │
└─────────────────────┘
When the hash keys have nulls, it will cause data skew easly.
Does it reproduce on recent release?
Enable crash reporting
If possible, change "enabled" to true in "send_crash_reports" section in
config.xml
:
<send_crash_reports>
<!-- Changing <enabled> to true allows sending crash reports to -->
<!-- the ClickHouse core developers team via Sentry https://sentry.io -->
<enabled>false</enabled>
How to reproduce
- Which ClickHouse server version to use
- Which interface to use, if matters
- Non-default settings, if any
CREATE TABLE
statements for all tables involved- Sample data for all these tables, use clickhouse-obfuscator if necessary
- Queries to run that lead to unexpected result
Run a aggregation query on on high cardinality keys, and the keys has nulls.
Expected behavior
A clear and concise description of what you expected to happen.
Error message and/or stacktrace
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.