hw202207 / poc_icu_comparison

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Table of Contents

  1. The Problem
  2. Troubleshooting

The Problem

The following query yields different in different Operation System. I did some google search and sounds like it is related to Locale/Collate settings.

SELECT ARRAY(SELECT UNNEST(ARRAY['id', 'organization_id', 'locked_by', 'lock_reasons', 'lock_note', 'lock_source', 'sent_notification_emails', 'created_at']) ORDER BY 1);
  • with Collate "C", I got result

    {created_at,id,lock_note,lock_reasons,lock_source,locked_by,organization_id,sent_notification_emails}
    
  • with Collate "UTF-8", I got result

    {created_at,id,locked_by,lock_note,lock_reasons,lock_source,organization_id,sent_notification_emails}
    

Troubleshooting

Looks like the break happens on whether _ is smaller than e or not. Presumably, it shall be given their sequence number in ASCII is 95 and 101 respectively.

However, following queries does not always yield True but True and False in different operating systems.

SELECT '_' COLLATE "en_US" < 'e' COLLATE "en_US";

SELECT 'lock_note' COLLATE "en_US" < 'locked_by' COLLATE "en_US";

I did some google search and the general impression I got is when locale is enabled, sorting in postgres leveraging locale rule from Operating System and the rules may vary cross OS, which could explain why aforementioned queries yields inconsistent result cross OS.

My actual problem to solve is to want same sorting result between postgres and my Application code(both run at same OS). A very quick skim to postgres source code gives me impression that postgres utilize ucol_strcollUTF8 when for UTF-8 sorting. So I assume if my application code call ucol_strcollUTF8, it shall given same result as postgres. I did POC (see details in icu_string_comparison.c) and I got following result, which is not same to postgres.

"lock_note" is smaller than "locked_by"

Quick recap

Code Result
SELECT 'lock_note' COLLATE "en_US" < 'locked_by' COLLATE "en_US" False
icu_string_comparison.c: 'lock_note' < 'locked_by' True

So I probably missed something like sorting in postgres is not as simply as using ucol_strcolUTF8 and something else.

About


Languages

Language:C 43.6%Language:Haskell 22.6%Language:JavaScript 19.8%Language:Makefile 14.1%