dolthub / dolt

Dolt – Git for Data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

have default collation be `utf8mb4_0900_ai_ci`

jycor opened this issue · comments

By default, MySQL's collation is utf8mb4_0900_ai_ci
https://dev.mysql.com/doc/refman/8.0/en/charset.html#:~:text=The%20default%20MySQL%20server%20character%20set%20and%20collation%20are%20utf8mb4%20and%20utf8mb4_0900_ai_ci

Dolt has it set as utf8mb4_0900_bin.

As a result, we have to set the collation through @@collation_connection (@@persist.collation_connection if you want it to stick for dolt sql and dolt sql -q) to get the same behavior out of the box.

tmp/main*> select @@collation_connection;
+------------------------+
| @@collation_connection |
+------------------------+
| utf8mb4_0900_bin       |
+------------------------+
1 row in set (0.00 sec)

tmp/main*> select 'abc' like 'ABC';
+------------------+
| 'abc' like 'ABC' |
+------------------+
| false            |
+------------------+
1 row in set (0.00 sec)

tmp/main*> set @@collation_connection = utf8mb4_0900_ai_ci;
tmp/main*> select 'abc' like 'ABC';
+------------------+
| 'abc' like 'ABC' |
+------------------+
| true             |
+------------------+
1 row in set (0.00 sec)

Additionally, MySQL's LIKE operator is always case-insensitive, regardless of @@collation_connection
https://stackoverflow.com/questions/14007450/how-do-you-force-mysql-like-to-be-case-sensitive

related: #7851

I believe this is a deliberate decision that we made for performance reasons / compatibility with older versions of Dolt.

@Hydrocharged did this for a reason. Please elaborate Daylon.

Responded to James and forgot to write it here too.

For utf8mb4_0900_bin, it's both legacy and performance. Before we had collations, Go's default string handling operates the exact same as utf8mb4_0900_bin, so we put that collation everywhere since it's technically correct.

Now though it's for performance, since we don't have to do anything special for those strings, but for all other collations we have to handle them in some special way. IIRC _ai_ci collations are pretty heavy, so we avoid them when we can as far as defaults go. In addition, the change would mean that importing the same table in a newer version would create a different table, which is its own issue.