tree-sitter / tree-sitter-haskell

Haskell grammar for tree-sitter.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UnicodeSyntax support

maralorn opened this issue · comments

I may be holding it wrong, but at least some unicode symbols are not supported as syntax:

e.g.:

processStateUpdater 
   a m.
  (NOMInput a, UpdateMonad m) 
  Config 
  u  a 
  StateT (ProcessState a) m ([NOMError], ByteString)

gives me

(haskell [0, 0] - [5, 52]
  (top_splice [0, 0] - [5, 52]
    (exp_infix [0, 0] - [5, 52]
      (exp_apply [0, 0] - [1, 9]
        (exp_name [0, 0] - [0, 19]
          (variable [0, 0] - [0, 19]))
        (ERROR [0, 20] - [1, 5]
          (ERROR [0, 20] - [0, 23]))
        (exp_name [1, 6] - [1, 7]
          (variable [1, 6] - [1, 7]))
        (exp_name [1, 8] - [1, 9]
          (variable [1, 8] - [1, 9])))
      (operator [1, 9] - [1, 10])
      (exp_apply [2, 2] - [5, 52]
        (exp_tuple [2, 2] - [2, 29]
          (exp_apply [2, 3] - [2, 13]
            (exp_name [2, 3] - [2, 11]
              (constructor [2, 3] - [2, 11]))
            (exp_name [2, 12] - [2, 13]
              (variable [2, 12] - [2, 13])))
          (comma [2, 13] - [2, 14])
          (exp_apply [2, 15] - [2, 28]
            (exp_name [2, 15] - [2, 26]
              (constructor [2, 15] - [2, 26]))
            (exp_name [2, 27] - [2, 28]
              (variable [2, 27] - [2, 28]))))
        (ERROR [2, 30] - [2, 33]
          (ERROR [2, 30] - [2, 33]))
        (exp_name [3, 2] - [3, 8]
          (constructor [3, 2] - [3, 8]))
        (ERROR [3, 9] - [3, 12]
          (ERROR [3, 9] - [3, 12]))
        (exp_name [4, 2] - [4, 3]
          (variable [4, 2] - [4, 3]))
        (ERROR [4, 4] - [4, 7]
          (ERROR [4, 4] - [4, 7]))
        (exp_name [5, 2] - [5, 8]
          (constructor [5, 2] - [5, 8]))
        (exp_parens [5, 9] - [5, 25]
          (exp_apply [5, 10] - [5, 24]
            (exp_name [5, 10] - [5, 22]
              (constructor [5, 10] - [5, 22]))
            (exp_name [5, 23] - [5, 24]
              (variable [5, 23] - [5, 24]))))
        (exp_name [5, 26] - [5, 27]
          (variable [5, 26] - [5, 27]))
        (exp_tuple [5, 28] - [5, 52]
          (exp_list [5, 29] - [5, 39]
            (exp_name [5, 30] - [5, 38]
              (constructor [5, 30] - [5, 38])))
          (comma [5, 39] - [5, 40])
          (exp_name [5, 41] - [5, 51]
            (constructor [5, 41] - [5, 51])))))))

right! I added some basics now, but there are some more missing.

Thank you for the quick reaction. Yeah, those are probably the most important, nice.

Here is the list of all symbols, not so many are missing:

https://downloads.haskell.org/ghc/latest/docs/users_guide/exts/unicode_syntax.html

yep, already have that tab open 😉

commented

Thanks in advance. I wish I could help solve and not merely report the issue. But I'm getting errors when I simply use unicode characters in/as identifiers.

Possibly helpful links:

Example below, and please, don't judge me for the quality of this code. It's my first Haskell program, and it's fit for a very specific purpose which is not production. (It mirrors a theoretical construction in my PhD thesis in systems theory.)

{-# LANGUAGE InstanceSigs  #-}
{-# LANGUAGE UnicodeSyntax #-}

module PrAlgebra where

import           Data.Fix (Fix (Fix), foldFix, unFix)

(▽) :: (a  c)  (b  c)  Either a b  c
(▽) = either

(△) :: (b  c)  (b  c')  b  (c, c')
(△) f g x = (f x, g x)

newtype 𝘗hd tl = Pᵣ (Maybe (tl, hd))

instance Functor (𝘗hd) where
  fmap :: (a  b)  𝘗hd a  𝘗hd b
  fmap f (PNothing)         = PNothing
  fmap f (Pᵣ (Just (tl, hd))) = Pᵣ (Just (f tl, hd))

type 𝘗Algebra state value =  𝘗value state  state

type Snoc hd = Fix(𝘗hd)

snoc :: Snoc a  a  Snoc a
snoc xs x = Fix (Pᵣ (Just (xs, x)))

In terms of syntax highlighting, everything is coloured as a type. Here is a screenshot where a constructor is being called a type.
Screenshot from 2022-12-22 09-50-47

The tree is listed below.

pragma [0, 0] - [0, 30]
pragma [1, 0] - [1, 30]
module: module [3, 7] - [3, 16]
where [3, 17] - [3, 22]
ERROR [5, 0] - [25, 37]
  import [5, 0] - [5, 53]
    qualified_module [5, 17] - [5, 25]
      module [5, 17] - [5, 21]
      module [5, 22] - [5, 25]
    import_list [5, 26] - [5, 53]
      import_item [5, 27] - [5, 36]
        type [5, 27] - [5, 30]
        import_con_names [5, 31] - [5, 36]
          constructor [5, 32] - [5, 35]
      comma [5, 36] - [5, 37]
      import_item [5, 38] - [5, 45]
        variable [5, 38] - [5, 45]
      comma [5, 45] - [5, 46]
      import_item [5, 47] - [5, 52]
        variable [5, 47] - [5, 52]
  pat_literal [7, 0] - [7, 5]
    con_unit [7, 0] - [7, 5]
      ERROR [7, 1] - [7, 4]
        ERROR [7, 1] - [7, 4]
  type_parens [7, 9] - [7, 18]
    fun [7, 10] - [7, 17]
      type_name [7, 10] - [7, 11]
        type_variable [7, 10] - [7, 11]
      type_name [7, 16] - [7, 17]
        type_variable [7, 16] - [7, 17]
  type_parens [7, 23] - [7, 32]
    fun [7, 24] - [7, 31]
      type_name [7, 24] - [7, 25]
        type_variable [7, 24] - [7, 25]
      type_name [7, 30] - [7, 31]
        type_variable [7, 30] - [7, 31]
  type_apply [7, 37] - [7, 47]
    type_name [7, 37] - [7, 43]
      type [7, 37] - [7, 43]
    type_name [7, 44] - [7, 45]
      type_variable [7, 44] - [7, 45]
    type_name [7, 46] - [7, 47]
      type_variable [7, 46] - [7, 47]
  constraint [7, 52] - [25, 37]
    class: class_name [7, 52] - [7, 53]
      type_variable [7, 52] - [7, 53]
    type_literal [8, 0] - [8, 5]
      con_unit [8, 0] - [8, 5]
        ERROR [8, 1] - [8, 4]
          ERROR [8, 1] - [8, 4]
    ERROR [8, 6] - [8, 7]
    type_name [8, 8] - [8, 14]
      type_variable [8, 8] - [8, 14]
    type_literal [10, 0] - [10, 5]
      con_unit [10, 0] - [10, 5]
        ERROR [10, 1] - [10, 4]
          ERROR [10, 1] - [10, 4]
    ERROR [10, 6] - [10, 8]
    type_parens [10, 9] - [10, 18]
      fun [10, 10] - [10, 17]
        type_name [10, 10] - [10, 11]
          type_variable [10, 10] - [10, 11]
        type_name [10, 16] - [10, 17]
          type_variable [10, 16] - [10, 17]
    ERROR [10, 19] - [10, 22]
    type_parens [10, 23] - [10, 33]
      fun [10, 24] - [10, 32]
        type_name [10, 24] - [10, 25]
          type_variable [10, 24] - [10, 25]
        type_name [10, 30] - [10, 32]
          type_variable [10, 30] - [10, 32]
    ERROR [10, 34] - [10, 37]
    type_name [10, 38] - [10, 39]
      type_variable [10, 38] - [10, 39]
    ERROR [10, 40] - [10, 43]
    type_tuple [10, 44] - [10, 51]
      type_name [10, 45] - [10, 46]
        type_variable [10, 45] - [10, 46]
      comma [10, 46] - [10, 47]
      type_name [10, 48] - [10, 50]
        type_variable [10, 48] - [10, 50]
    type_literal [11, 0] - [11, 5]
      con_unit [11, 0] - [11, 5]
        ERROR [11, 1] - [11, 4]
          ERROR [11, 1] - [11, 4]
    type_name [11, 6] - [11, 7]
      type_variable [11, 6] - [11, 7]
    type_name [11, 8] - [11, 9]
      type_variable [11, 8] - [11, 9]
    type_name [11, 10] - [11, 11]
      type_variable [11, 10] - [11, 11]
    ERROR [11, 12] - [11, 13]
    type_tuple [11, 14] - [11, 24]
      type_apply [11, 15] - [11, 18]
        type_name [11, 15] - [11, 16]
          type_variable [11, 15] - [11, 16]
        type_name [11, 17] - [11, 18]
          type_variable [11, 17] - [11, 18]
      comma [11, 18] - [11, 19]
      type_apply [11, 20] - [11, 23]
        type_name [11, 20] - [11, 21]
          type_variable [11, 20] - [11, 21]
        type_name [11, 22] - [11, 23]
          type_variable [11, 22] - [11, 23]
    type_name [13, 0] - [13, 7]
      type_variable [13, 0] - [13, 7]
    ERROR [13, 8] - [13, 15]
      ERROR [13, 8] - [13, 15]
    type_name [13, 16] - [13, 18]
      type_variable [13, 16] - [13, 18]
    type_name [13, 19] - [13, 21]
      type_variable [13, 19] - [13, 21]
    ERROR [13, 22] - [13, 23]
    type_name [13, 24] - [13, 25]
      type [13, 24] - [13, 25]
    ERROR [13, 25] - [13, 28]
      ERROR [13, 25] - [13, 28]
    type_parens [13, 29] - [13, 45]
      type_apply [13, 30] - [13, 44]
        type_name [13, 30] - [13, 35]
          type [13, 30] - [13, 35]
        type_tuple [13, 36] - [13, 44]
          type_name [13, 37] - [13, 39]
            type_variable [13, 37] - [13, 39]
          comma [13, 39] - [13, 40]
          type_name [13, 41] - [13, 43]
            type_variable [13, 41] - [13, 43]
    type_name [15, 0] - [15, 8]
      type_variable [15, 0] - [15, 8]
    type_name [15, 9] - [15, 16]
      type [15, 9] - [15, 16]
    type_parens [15, 17] - [15, 29]
      ERROR [15, 18] - [15, 25]
        ERROR [15, 18] - [15, 25]
      type_name [15, 26] - [15, 28]
        type_variable [15, 26] - [15, 28]
    type_name [15, 30] - [15, 35]
      type_variable [15, 30] - [15, 35]
    type_name [16, 2] - [16, 6]
      type_variable [16, 2] - [16, 6]
    ERROR [16, 7] - [16, 9]
    type_parens [16, 10] - [16, 19]
      fun [16, 11] - [16, 18]
        type_name [16, 11] - [16, 12]
          type_variable [16, 11] - [16, 12]
        type_name [16, 17] - [16, 18]
          type_variable [16, 17] - [16, 18]
    ERROR [16, 20] - [16, 31]
      ERROR [16, 24] - [16, 31]
    type_name [16, 32] - [16, 34]
      type_variable [16, 32] - [16, 34]
    type_name [16, 35] - [16, 36]
      type_variable [16, 35] - [16, 36]
    ERROR [16, 37] - [16, 48]
      ERROR [16, 41] - [16, 48]
    type_name [16, 49] - [16, 51]
      type_variable [16, 49] - [16, 51]
    type_name [16, 52] - [16, 53]
      type_variable [16, 52] - [16, 53]
    type_name [17, 2] - [17, 6]
      type_variable [17, 2] - [17, 6]
    type_name [17, 7] - [17, 8]
      type_variable [17, 7] - [17, 8]
    type_parens [17, 9] - [17, 23]
      type_apply [17, 10] - [17, 22]
        type_name [17, 10] - [17, 11]
          type [17, 10] - [17, 11]
        ERROR [17, 11] - [17, 14]
          ERROR [17, 11] - [17, 14]
        type_name [17, 15] - [17, 22]
          type [17, 15] - [17, 22]
    ERROR [17, 32] - [17, 33]
    type_name [17, 34] - [17, 35]
      type [17, 34] - [17, 35]
    ERROR [17, 35] - [17, 38]
      ERROR [17, 35] - [17, 38]
    type_name [17, 39] - [17, 46]
      type [17, 39] - [17, 46]
    type_name [18, 2] - [18, 6]
      type_variable [18, 2] - [18, 6]
    type_name [18, 7] - [18, 8]
      type_variable [18, 7] - [18, 8]
    type_parens [18, 9] - [18, 31]
      type_apply [18, 10] - [18, 30]
        type_name [18, 10] - [18, 11]
          type [18, 10] - [18, 11]
        ERROR [18, 11] - [18, 14]
          ERROR [18, 11] - [18, 14]
        type_parens [18, 15] - [18, 30]
          type_apply [18, 16] - [18, 29]
            type_name [18, 16] - [18, 20]
              type [18, 16] - [18, 20]
            type_tuple [18, 21] - [18, 29]
              type_name [18, 22] - [18, 24]
                type_variable [18, 22] - [18, 24]
              comma [18, 24] - [18, 25]
              type_name [18, 26] - [18, 28]
                type_variable [18, 26] - [18, 28]
    ERROR [18, 32] - [18, 33]
    type_name [18, 34] - [18, 35]
      type [18, 34] - [18, 35]
    ERROR [18, 35] - [18, 38]
      ERROR [18, 35] - [18, 38]
    type_parens [18, 39] - [18, 56]
      type_apply [18, 40] - [18, 55]
        type_name [18, 40] - [18, 44]
          type [18, 40] - [18, 44]
        type_tuple [18, 45] - [18, 55]
          type_apply [18, 46] - [18, 50]
            type_name [18, 46] - [18, 47]
              type_variable [18, 46] - [18, 47]
            type_name [18, 48] - [18, 50]
              type_variable [18, 48] - [18, 50]
          comma [18, 50] - [18, 51]
          type_name [18, 52] - [18, 54]
            type_variable [18, 52] - [18, 54]
    type_name [20, 0] - [20, 4]
      type_variable [20, 0] - [20, 4]
    ERROR [20, 5] - [20, 12]
      ERROR [20, 5] - [20, 12]
    type_name [20, 12] - [20, 19]
      type [20, 12] - [20, 19]
    type_name [20, 20] - [20, 25]
      type_variable [20, 20] - [20, 25]
    type_name [20, 26] - [20, 31]
      type_variable [20, 26] - [20, 31]
    ERROR [20, 32] - [20, 42]
      ERROR [20, 35] - [20, 42]
    type_name [20, 43] - [20, 48]
      type_variable [20, 43] - [20, 48]
    type_name [20, 49] - [20, 54]
      type_variable [20, 49] - [20, 54]
    ERROR [20, 55] - [20, 58]
    type_name [20, 59] - [20, 64]
      type_variable [20, 59] - [20, 64]
    type_name [22, 0] - [22, 4]
      type_variable [22, 0] - [22, 4]
    type_name [22, 5] - [22, 9]
      type [22, 5] - [22, 9]
    type_name [22, 10] - [22, 12]
      type_variable [22, 10] - [22, 12]
    ERROR [22, 13] - [22, 14]
    type_name [22, 15] - [22, 18]
      type [22, 15] - [22, 18]
    type_parens [22, 18] - [22, 30]
      ERROR [22, 19] - [22, 26]
        ERROR [22, 19] - [22, 26]
      type_name [22, 27] - [22, 29]
        type_variable [22, 27] - [22, 29]
    type_name [24, 0] - [24, 4]
      type_variable [24, 0] - [24, 4]
    ERROR [24, 5] - [24, 7]
    type_name [24, 8] - [24, 12]
      type [24, 8] - [24, 12]
    type_name [24, 13] - [24, 14]
      type_variable [24, 13] - [24, 14]
    ERROR [24, 15] - [24, 18]
    type_name [24, 19] - [24, 20]
      type_variable [24, 19] - [24, 20]
    ERROR [24, 21] - [24, 24]
    type_name [24, 25] - [24, 29]
      type [24, 25] - [24, 29]
    type_name [24, 30] - [24, 31]
      type_variable [24, 30] - [24, 31]
    type_name [25, 0] - [25, 4]
      type_variable [25, 0] - [25, 4]
    type_name [25, 5] - [25, 7]
      type_variable [25, 5] - [25, 7]
    type_name [25, 8] - [25, 9]
      type_variable [25, 8] - [25, 9]
    ERROR [25, 10] - [25, 11]
    type_name [25, 12] - [25, 15]
      type [25, 12] - [25, 15]
    type_parens [25, 16] - [25, 37]
      type_apply [25, 17] - [25, 36]
        type_name [25, 17] - [25, 18]
          type [25, 17] - [25, 18]
        ERROR [25, 18] - [25, 21]
          ERROR [25, 18] - [25, 21]
        type_parens [25, 22] - [25, 36]
          type_apply [25, 23] - [25, 35]
            type_name [25, 23] - [25, 27]
              type [25, 23] - [25, 27]
            type_tuple [25, 28] - [25, 35]
              type_name [25, 29] - [25, 31]
                type_variable [25, 29] - [25, 31]
              comma [25, 31] - [25, 32]
              type_name [25, 33] - [25, 34]
                type_variable [25, 33] - [25, 34]

I added three more symbols for built-in syntax.

I also took a look at the symbolic operator situation, and it's a little bit more difficult.
Legal characters for these varsyms are determined by membership in unicode categories, which contain about 6000 code points in noncontiguous intervals.

We are parsing varsyms in the scanner, which means we don't have access to the unicode category regex classes that are provided by tree-sitter.
I couldn't find a method to do this in standard C, but maybe someone knows better?
For what it's worth, I tried adding a switch with 6k cases and performance only degraded by about 1%.

I am not sure, what the rules here are, but would it be terrible to over-approximate here? (Also don’t know if it would simplify things) I would assume that by allowing a larger class of unicode symbols that is maybe easier to check it would be unlikely to miss-parse valid Haskell?

possibly, but I'm absolutely uncertain. 6k code points in a range of 130k seems quite disproportionate, and they are spaced out pretty wide.
We could try > N for some value and test all smaller ones explicitly.
But since performance doesn't take a significant hit, we could also just put the 6k cases in a separate file in a switch and be done with it 🙃

Your call. I would also wonder a bit how much bigger the grammar would become …

the haskell.so grows by 10kB. (total 3.6MB)

the arrow notation operators appear not to be within the categories used for the PR we just merged. also unsure about those banana brackets, they would probably need special treatment.