Enhancement: Optimize generated classes with character data

Question

Enhancement: Optimize generated classes with character data

OptimumCode opened this issue 3 months ago · comments

Is there an existing issue for this?

I have searched the existing issues

Enhancement description

Currently generated classes with character data check if the codepoint belongs to this class by using some kind of binary search between the ranges this class has. This causes O(log(n)) complexity to find if the class contains the codepoint in the worst case.

I think this can be optimized in the following way:

We can divide ranges into sub-groups by applying a shift-right bit operation (codepoint >> 16). This will group ranges based on their higher bits. As a result, we will have smaller groups of ranges to apply our binary search and we will be able to filter out codepoints that cannot belong to the class by comparing the result of a shift-right operation with known values for this class.

fun contains(codepoint: Int): Boolean {
  if (codepoint < minCodepoint || codepoint > maxCodepoint) {
    return false
  }
  return when (codepoint shr 8) {
    in 0x0..0x2 -> // binary search in this group
    0x10 -> // binary search in this group
    else -> false
  }
}

Oleg Smirnov · Answer 1 · Thu Apr 18 2024 17:46:08 GMT+0800 (China Standard Time)

Any attempt to group ranges and keep the correctness did not give any performance benefits. Benchmarks show that the performance only gets worse (2-3 times worse).

I will close this issue for now. If new ideas on how to optimize the generated code come to my mind (or somebody else's) I will reopen this issue.