composewell / unicode-transforms

Fast Unicode normalization in Haskell

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Further optimize the compose code

harendra-kumar opened this issue · comments

Decompose code is well optimized but compose still has a lot of scope for optimization. Though its performance is close to utf8proc that we were using earlier, it is still far away from icu compose performance.

Currently we are not using quickcheck properties of unicode database, we can explore using the quickcheck properties to speed up the case when the string is already in composed or almost composed form. Some related links:

Just random thoughts:

  1. Even if we perform full decomposition + full composition, I would expect NFC times to be roughly 2x of NFD. But it is more than 5x for some languages (Deutsch, English), which hopefully hints a missing optimization opportunity.

  2. My understanding is that faster composition algorithms require a more flexible access to the text, incompatible with the streaming approach. It sounds it involves speculative writes with a possibility of backtracking, once we encounter a combining (or composable) character.

I think we can go a long way even with full decomposition and composition. I wrote composeChar to just complete the library without any thought to perf. I just copied the code from decomposition and somehow made it to work.

  1. The recursive call to composeChar in decomposeAll is pretty expensive, we can possibly remove that, its there to go through Hangul path just in case the next char is Hangul. You can try this to find the impact:
                _ -> do
                    -- XXX this recursive call here hurts performance
                    -- We can make the hangul composition a separate function
                    -- and call that or reorder here based on the type fo char
                    -- (i', st', rbuf', jb') <- composeChar mode arr i st rbuf jb x
                    (i', st', rbuf') <- reorder marr i st rbuf ch
                    decomposeAll arr i' st' rbuf' jb xs
  1. The reorder function is perhaps not ideal, it can possibly be made much faster. You can find the cost of just decomposition in composeChar by defining composePair as:
composePair c _ = Just c

It is pretty expensive even with this null definition of composePair.

I guess quick check based normalization can be performed as we are writing to an array. It should roughly go like this - we look up the current char in the quick check table, if the character is NO or Maybe or is in the composition exclusion table then we backtrack to the last starter. We have access to the whole array to which we have already written the previous chars and we should be able to easily backtrack in that array.

We will have to create the quickcheck properties file from the UCD, in ucd2haskell script. It is in http://www.unicode.org/Public/UCD/latest/ucd/DerivedNormalizationProps.txt .

In addition to the links in the issue description above, the normalization forms doc has a lot of useful information, especially these sections:

Another thing - composePair is very inefficient because the patterns are not sorted so it becomes a linear search. We can sort it on first argument and then on the second argument and then I hope GHC would do the right thing.

I do not think that composePair is a linear search, profiling assigns only 2% of runtime to it. My hypothesis is that it is a binary search already, which is fast per se, but spoils CPU caches.