Chevrotain / chevrotain

Parser Building Toolkit for JavaScript

Home Page:https://chevrotain.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Flawed logic in RecognizerEngine

anesterenok opened this issue · comments

Running this code throws an exception:

const { EmbeddedActionsParser } = require("chevrotain")

// numbers Tokens
const tokens = [
  { name: 'ID', PATTERN: /[_a-zA-Z][\w_]*/ },
  { name: 'INT', PATTERN: /[-]?[0-9]+/ },
  { name: 'STRING', PATTERN: /"[^"]*"|'[^']*'/ },
  { name: 'ML_COMMENT', GROUP: 'hidden', PATTERN: /\/\*[\s\S]*?\*\// },
  { name: 'SL_COMMENT', GROUP: 'hidden', PATTERN: /\/\/[^\n\r]*/ },
]

// Each key defines a Lexer mode's name.
// And each value is an array of Tokens which are valid in this Lexer mode.
const mode1 = [
      tokens[0],
      tokens[1],
      tokens[2],
    ]

const mode2 = [
      tokens[2],
      tokens[3],
      tokens[4],
    ]

const tokenVocabulary = {
  modes: {
    default: mode1,
    mline: mode2
  },
  defaultMode: "default"
}

const parser = new EmbeddedActionsParser(tokenVocabulary)

Output:

            currTokType.tokenTypeIdx = exports.tokenShortNameIdx++;
                                     ^

TypeError: Cannot create property 'tokenTypeIdx' on string 'default'
    at ←[90mC:\Projects\vscode\test\←[39mnode_modules\←[4mchevrotain←[24m\lib\src\scan\tokens.js:71:38
    at arrayEach ←[90m(C:\Projects\vscode\test\←[39mnode_modules\←[4mlodash←[24m\_arrayEach.js:15:9←[90m)←[39m
    at forEach ←[90m(C:\Projects\vscode\test\←[39mnode_modules\←[4mlodash←[24m\forEach.js:38:10←[90m)←[39m
    at assignTokenDefaultProps ←[90m(C:\Projects\vscode\test\←[39mnode_modules\←[4mchevrotain←[24m\lib\src\scan\tokens.js:68:27←[90m)←[39m
    at augmentTokenTypes ←[90m(C:\Projects\vscode\test\←[39mnode_modules\←[4mchevrotain←[24m\lib\src\scan\tokens.js:40:5←[90m)←[39m
    at EmbeddedActionsParser.RecognizerEngine.initRecognizerEngine ←[90m(C:\Projects\vscode\test\←[39mnode_modules\←[4mchevrotain←[24m\lib\src\parse\parser\traits\recognizer_engine.js:103:40←[90m)←[39m
    at EmbeddedActionsParser.Parser ←[90m(C:\Projects\vscode\test\←[39mnode_modules\←[4mchevrotain←[24m\lib\src\parse\parser\parser.js:91:14←[90m)←[39m
    at new EmbeddedActionsParser ←[90m(C:\Projects\vscode\test\←[39mnode_modules\←[4mchevrotain←[24m\lib\src\parse\parser\parser.js:224:23←[90m)←[39m
    at Object.<anonymous> ←[90m(C:\Projects\vscode\test\←[39mindex_bug1.js:34:16←[90m)←[39m
←[90m    at Module._compile (node:internal/modules/cjs/loader:1126:14)←[39m

This is happening because RecognizerEngine

// We cannot assume that the Token classes were created using the "extendToken" utilities

does not assume that input TokenTypes are already augmented with some technical fields.
This is actually very good, because other products using Chevro (e.g. Langium) do create TokenTypes without calling createToken() or other ways.

But for multi-modal grammars, logic in

every(flatten(values((<any>tokenVocabulary).modes)), isTokenType)
prevents to reach line 180, because the check for isTokenType is check for tokenTypeIdx attribute existence, which is not possible before augmentation.

So basically token defs must be augmented to run augmentation in case they aren't.

I'm not so sure on what's better to be changed in that regard, so please fix as you see fit.

Hey @anesterenok,

Talking purely about the Langium use case, this is something that we've fixed in eclipse-langium/langium#579 and the fix is already available in one of the newer snapshot versions for the upcoming 0.5.0 release.

Basically, Chevrotain expects that any tokens that are passed to a parser instance/constructor are always part of a lexer first, so maybe we should check for that in the RecognizerEngine code.

Hi @anesterenok

Thanks for providing a reproducible example and your analysis, this will speed up the debugging.
Perhaps augmenting should done earlier in that code, except I am not sure of the side effects.

I will investigate this.

Sorry for the long delay...

Anyhow you should create Tokens using the createToken() API, not as plain JavaScript objects.
It performs additional logic in the background and augments the Token descriptor objects with additional properties.

I can reproduce the issue with the plain token objects, however it is simply not how the API should be used
So I will be closing this issue...