Kyligence / ClickHouse

ClickHouse® is a free analytics DBMS for big data

Home Page:https://clickhouse.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

There is data skew in hash shuffle

lgbo-ustc opened this issue · comments

commented

You have to provide the following information whenever possible.

Describe what's wrong

A clear and concise description of what works not as it is supposed to.

A link to reproducer in https://fiddle.clickhouse.com/.

We run a aggregation on high cardinality keys, and found data skew.
We notice that the hash function's behavior is different in Clickhouse and Spark in dealing with nulls.
In spark

select hash(null);
+-------------+
| hash(NULL)  |
+-------------+
| 42          |
+-------------+
1 row selected (0.089 seconds)

select hash(1,null);
+----------------+
| hash(1, NULL)  |
+----------------+
| -559580957     |
+----------------+
1 row selected (0.129 seconds)

In clickhouse

SELECT cityHash64(1, NULL)


┌─cityHash64(1, NULL)─┐
│ ᴺᵁᴸᴸ                │
└─────────────────────┘

When the hash keys have nulls, it will cause data skew easly.

Does it reproduce on recent release?

The list of releases

Enable crash reporting

If possible, change "enabled" to true in "send_crash_reports" section in config.xml:

<send_crash_reports>
        <!-- Changing <enabled> to true allows sending crash reports to -->
        <!-- the ClickHouse core developers team via Sentry https://sentry.io -->
        <enabled>false</enabled>

How to reproduce

  • Which ClickHouse server version to use
  • Which interface to use, if matters
  • Non-default settings, if any
  • CREATE TABLE statements for all tables involved
  • Sample data for all these tables, use clickhouse-obfuscator if necessary
  • Queries to run that lead to unexpected result

Run a aggregation query on on high cardinality keys, and the keys has nulls.

Expected behavior

A clear and concise description of what you expected to happen.

Error message and/or stacktrace

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.