[TypeScript]: emitToken(token Token) is not pushing new tokens to the TokensArray in CommonTokenStream

Question

[TypeScript]: emitToken(token Token) is not pushing new tokens to the TokensArray in CommonTokenStream

SuraiyaBegumK opened this issue 4 months ago · comments

I created a parser in TypeScript for YAML like language but my language is more simpler one. I want to handle the indents and dedents when newline token occurs

Issue 1: Observed that everytime I try to get next token using super.nextToken function it is directly calling emitToken() fucntion and pushing a token in tokens array. Due to this token is getting pushed before the check needs to be performed (Example: Added condition to skip the white spaces using Skip() but token is already pushed before reaching this line)

Issue 2: When I create instance of Lexer I can see all the tokens those got pushed (Even unnecessary tokens: Issue 1 ) but after trying convert them into TokenStream using 'CommonTokenStream', I can't see the tokens I pushed in the tokens array but I noticed in side tokenSource there are tokens but unable to access them

After passing Lexer to CommonTokenStream

Lexer Class in ANTLR4

`export declare class Lexer extends Recognizer {

static DEFAULT_MODE: number;

_input: CharStream;
_interp: LexerATNSimulator;
text: string;
line: number;
column: number;
_tokenStartCharIndex: number;
_tokenStartLine: number;
_tokenStartColumn: number;
_type: number;

constructor(input: CharStream);
reset(): void;
nextToken(): Token;
skip(): void;
more(): void;
more(m: number): void;
pushMode(m: number): void;
popMode(): number;
emitToken(token: Token): void;
emit(): Token;
emitEOF(): Token;
getAllTokens(): Token[];

}
`

Logic implemented to Push Indents and Dedents token array

`import { CharStream, Token, CommonToken, Lexer } from "antlr4";
import MyParser from "./MyParser";

export default class MyLexerBase extends Lexer {
/**
* Our document does not explicity provide begin and end nesting tokens, the indentation is used to determine the nesting level.
* So solve this, we need to keep track of the indentation level and emit the INDENT and DEDENT tokens when the indentation level changes.
* Multiple DEDENT tokens may be emitted if the indentation level decreases. The same needs to be sent to the parser without any input symbols.
* e.g.
* Example 1:
* if a == 1:
* print("a")
* print("b")
* if a > 2:
* print("c")
* print("b")
*
* Output: if a == 1: \n INDENT print("a") \n print("b") \n if a > 2: \n INDENT print("c") \n DEDENT DEDENT print("b") \n EOF
*
* Example 2:
* table Sales
* column id
* datatype: Int64
* primaryKey
* summarizeBy: None
*
* Output: table Sales NEWLINE INDENT column id NEWLINE INDENT datatype : Int64 NEWLINE primaryKey NEWLINE summarizeBy : None NEWLINE DEDENT DEDENT EOF
*
*/
tokens: any[];
indents: any[];
opened: number;

constructor(input: CharStream) {
    super(input);
    this.tokens = [];
    this.indents = [];
    this.opened = 0;
}

reset() {
    // A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
    this.tokens = [];

    // The stack that keeps track of the indentation level.
    this.indents = [];

    // The amount of opened braces, brackets and parenthesis.
    this.opened = 0;

    super.reset();
}

// Override emit method to customize token emission if necessary
emitToken(token: Token) {
    try {
        super.emitToken(token);
        this.tokens.push(token);
    } catch (error) {
        console.error("Error occurred during token emission");
        // Handle the error as needed
    }
}

// Override nextToken() to add custom logic for emitting additional tokens
public nextToken(): Token {
    const nextToken: Token = super.nextToken();

    if (nextToken.channel === 0) {
        // If token is on default channel, handle it
        // Add custom logic to emit additional INDENT and DEDENT tokens based on existing tokens
        if (nextToken.type === MyParser.NEWLINE) {
            // Emit additional tokens here
            this.emitAdditionalToken();
            // console.log("Text: NEWLINE" + ", Type: " + nextToken.type + ", Channel: " + nextToken.channel);
        }
        else if (nextToken.type === MyParser.WS) {
            this.skip();
        }
        else {
            // if not a NEWLINE token, return the token as-is
            // console.log("Text: " + nextToken.text + ", Type: " + nextToken.type + ", Channel: " + nextToken.channel);
        }
        return nextToken;
    } else {
        // Otherwise, do nothing or handle it as per your requirement.
        return nextToken;
    }
}

/**
 * 
 */
emitAdditionalToken() {
    let newLine = this.text.replace(/[^\r\n]+/g, '');

    // Strip newlines inside open clauses except if we are near EOF. We keep NEWLINEs near EOF to
    // satisfy the final newline needed by the single_put rule used by the REPL.
    let next = this._input.LA(1);
    let nextnext = this._input.LA(2);
    if (this.opened > 0 || (nextnext != -1 /* EOF */ && (next === 13 /* '\r' */ || next === 10 /* '\n' */ || next === 35 /* '#' */))) {
        // If we're inside a list or on a blank line, ignore all indents,
        // dedents and line breaks.
        this.skip();
    } else {
        this.emitToken(this.commonToken(MyParser.NEWLINE, newLine));

        // let nextToken = this.nextToken();
        // let spaces = nextToken.text.replace(/[\r\n]+/g, '');
        // let cpos = this.getIndentationCount(spaces);

        // Todo:
        let cpos = 0;
        this.emitToken(this.commonToken(MyParser.INDENT, " "));
        console.log("Adding indent token to the stack.");

        // console.log("Indentation count: " + cpos);
        let previous = this.indents.length ? this.indents[this.indents.length - 1] : 0;

        if (cpos === previous) {
            // skip indents of the same size as the present indent-size
            this.skip();
        } else if (cpos > previous) {
            console.log("Adding indent token to the stack.");
            this.indents.push(cpos);
            // console.log("Indentation level: " + cpos);
            //this.emitToken(this.commonToken(MyParser.INDENT, spaces));
        } else {
            // Possibly emit more than 1 DEDENT token.
            while (this.indents.length && this.indents[this.indents.length - 1] > cpos) {
                this.emitToken(this.createDedent());
                // console.log("Removing indent token from the stack.")
                this.indents.pop();
            }
        }
    }
}

createDedent() {
    return this.commonToken(MyParser.DEDENT, "");
}

getCharIndex() {
    return this._input.index;
}

commonToken(type: number, text: string) {
    let stop = this.getCharIndex() - 1;
    let start = text.length ? stop - text.length + 1 : stop;
    return new CommonToken([this, this._input], type, 0, start, stop);
}

getIndentationCount(whitespace: string) {
    let count = 0;
    for (let i = 0; i < whitespace.length; i++) {
        if (whitespace[i] === '\t') {
            count += 8 - count % 8;
        } else {
            count++;
        }
    }
    return count;
}

}

`

Ken Domino · Answer 1 · Fri Mar 22 2024 22:55:27 GMT+0800 (China Standard Time)

for YAML like language

This bug applies to which grammar in this repo, grammars-v4? We don't have a grammar for yaml.

I recommend that you try the Antlr4ng tool and runtime. There is a TypeScript target for Antlr 4.13.1, but it is unlikely any fixes will be addressed in that code. You can then raise an Github Issue over there if you see a problem with the runtime. You will need to use the Antlr4ng-cli tool to generate the updated code for your parser.

SuraiyaBegumK · Answer 2 · Tue Mar 26 2024 12:24:25 GMT+0800 (China Standard Time)

I have my own grammar I referred Python sample in this repo for Indent handling logic

// Override emit method to customize token emission if necessary
emitToken(token: Token) {
super.emitToken(token);
this.tokens.push(token);
}

where I am emitting Intent and Dedent custom tokens, they are not getting emitted

Robert Einhorn · Answer 3 · Wed Mar 27 2024 06:43:41 GMT+0800 (China Standard Time)

I don't know if it helps.
I use something like this to insert INDENT and DEDENT tokens for Python Lexers with TypeScript target.
There is no emitToken().

import { CharStream, Token, CommonToken, Lexer } from "antlr4";
import PythonLexer from "./PythonLexer";
import * as Collections from "typescript-collections";

export default abstract class PythonLexerBase extends Lexer {
    // A stack that keeps track of the indentation lengths
    private indentLengthStack!: Collections.Stack<number>;
    // A list where tokens are waiting to be loaded into the token stream
    private pendingTokens!: Array<Token>;

    // last pending token types
    private previousPendingTokenType!: number;
    private lastPendingTokenTypeFromDefaultChannel!: number;

    private curToken: CommonToken | undefined; // current (under processing) token
    private ffgToken: Token | undefined;       // following (look ahead) token

    protected constructor(input: CharStream) {
        super(input);
        this.init();
    }

    private init(): void {
        this.indentLengthStack = new Collections.Stack<number>();
        this.pendingTokens = [];
        this.previousPendingTokenType = 0;
        this.lastPendingTokenTypeFromDefaultChannel = 0;
        this.curToken = undefined;
        this.ffgToken = undefined;
    }

    public nextToken(): Token { // reading the input stream until a return EOF
        this.checkNextToken();
        return this.pendingTokens.shift()!; // add the queued token to the token stream
    }

    private checkNextToken(): void {
        if (this.previousPendingTokenType !== PythonLexer.EOF) {
            this.setCurrentAndFollowingTokens();
            if (this.indentLengthStack.isEmpty()) { // We're at the first token
                this.handleStartOfInput();
            }

            switch (this.curToken!.type) {
                case PythonLexer.NEWLINE:
                    this.handleNEWLINEtoken();
                    break;
                case PythonLexer.EOF:
                    this.handleEOFtoken();
                    break;
                default:
                    this.addPendingToken(this.curToken!);
            }
        }
    }

    private setCurrentAndFollowingTokens() {
        this.curToken = this.ffgToken == undefined
            ? this.getCommonTokenByToken(super.nextToken())
            : this.getCommonTokenByToken(this.ffgToken);

        this.ffgToken = this.curToken.type === PythonLexer.EOF
            ? this.curToken
            : this.getCommonTokenByToken(super.nextToken());
    }

    private handleStartOfInput() {
        // initialize the stack with a default 0 indentation length
        this.indentLengthStack.push(0); // this will never be popped off
    }

    private handleNEWLINEtoken() {
        const nlToken = this.curToken!; // save the current NEWLINE token
        const isLookingAhead = this.ffgToken!.type === PythonLexer.WS;
        if (isLookingAhead) {
            this.setCurrentAndFollowingTokens(); // set the next two tokens
        }

        switch (this.ffgToken!.type) {
            case PythonLexer.NEWLINE: // We're before a blank line
            case PythonLexer.COMMENT: // We're before a comment
                this.hideAndAddPendingToken(nlToken);
                if (isLookingAhead) {
                    this.addPendingToken(this.curToken!);  // WS token
                }
                break;
            default:
                this.addPendingToken(nlToken);
                if (isLookingAhead) { // We're on whitespace(s) followed by a statement
                    const indentationLength = this.ffgToken!.type === PythonLexer.EOF ?
                        0 :
                        this.getIndentationLength(this.curToken!.text);

                    this.addPendingToken(this.curToken!); // WS token
                    this.insertIndentOrDedentToken(indentationLength); // may insert INDENT token or DEDENT token(s)
                } else { // We're at a newline followed by a statement (there is no whitespace before the statement)
                    this.insertIndentOrDedentToken(0); // may insert DEDENT token(s)
                }
        }
    }

    private insertIndentOrDedentToken(curIndentLength: number) {
        let prevIndentLength: number = this.indentLengthStack.peek()!;
        if (curIndentLength > prevIndentLength) {
            this.createAndAddPendingToken(PythonLexer.INDENT, Token.DEFAULT_CHANNEL, "INDENT", this.ffgToken!);
            this.indentLengthStack.push(curIndentLength);
        } else {
            while (curIndentLength < prevIndentLength) { // more than 1 DEDENT token may be inserted to the token stream
                this.indentLengthStack.pop();
                prevIndentLength = this.indentLengthStack.peek()!;
                if (curIndentLength <= prevIndentLength) {
                    this.createAndAddPendingToken(PythonLexer.DEDENT, Token.DEFAULT_CHANNEL, "DEDENT", this.ffgToken!);
                } else {
                    // this.reportError("inconsistent dedent");
                }
            }
        }
    }

    private insertTrailingTokens() {
        switch (this.lastPendingTokenTypeFromDefaultChannel) {
            case PythonLexer.NEWLINE:
            case PythonLexer.DEDENT:
                break; // no trailing NEWLINE token is needed
            default:
                // insert an extra trailing NEWLINE token that serves as the end of the last statement
                this.createAndAddPendingToken(PythonLexer.NEWLINE, Token.DEFAULT_CHANNEL, "NEWLINE", this.ffgToken!); // ffgToken is EOF
        }
        this.insertIndentOrDedentToken(0); // Now insert as much trailing DEDENT tokens as needed
    }

    private handleEOFtoken() {
        if (this.lastPendingTokenTypeFromDefaultChannel > 0) {
            // there was a statement in the input (leading NEWLINE tokens are hidden)
            this.insertTrailingTokens();
        }
        this.addPendingToken(this.curToken!);
    }

    private hideAndAddPendingToken(cToken: CommonToken) {
        cToken.channel = Token.HIDDEN_CHANNEL;
        this.addPendingToken(cToken);
    }

    private createAndAddPendingToken(type: number, channel: number, text: string, baseToken: Token) {
        const cToken: CommonToken = this.getCommonTokenByToken(baseToken);
        cToken.type = type;
        cToken.channel = channel;
        cToken.stop = baseToken.start - 1;
        cToken.text = text;
        this.addPendingToken(cToken);
    }

    private addPendingToken(token: Token) {
        // save the last pending token type because the pendingTokens list can be empty by the nextToken()
        this.previousPendingTokenType = token.type;
        if (token.channel === Token.DEFAULT_CHANNEL) {
            this.lastPendingTokenTypeFromDefaultChannel = this.previousPendingTokenType;
        }
        this.pendingTokens.push(token);
    }

    private getCommonTokenByToken(oldToken: Token): CommonToken {
        const cToken = new CommonToken([this, oldToken.getInputStream()], oldToken.type, oldToken.channel, oldToken.start, oldToken.stop);
        cToken.tokenIndex = oldToken.tokenIndex;
        cToken.line = oldToken.line;
        cToken.column = oldToken.column;
        cToken.text = oldToken.text;
        return cToken;
    }

    private getIndentationLength(textWS: string): number {
        const TAB_LENGTH = 8; // the standard number of spaces to replace a tab to spaces
        let length = 0;

        for (let ch of textWS) {
            switch (ch) {
                case " ":
                    length += 1;
                    break;
                case "\t":
                    length += TAB_LENGTH - (length % TAB_LENGTH);
                    break;
            }
        }
        return length;
    }

    public reset() {
        this.init();
        super.reset();
    }
}

SuraiyaBegumK · Answer 4 · Wed Mar 27 2024 15:45:14 GMT+0800 (China Standard Time)

Hello RobEin

Thank you for sharing detailed logic.

Now I don't see DEFAULT_CHANNEL/HIDDEN_CHANNEL in the Token.d.ts. Could you please help me out here @RobEin .

Token.d.ts

import {CharStream} from "./CharStream";

export declare class Token {

static EOF: number;

tokenIndex: number;
line: number;
column: number;
channel: number;
text: string;
type: number;
start : number;
stop: number;

clone(): Token;
cloneWithType(type: number): Token;
getInputStream(): CharStream;

}

Robert Einhorn · Answer 5 · Wed Mar 27 2024 17:44:20 GMT+0800 (China Standard Time)

Already added to Token.d.ts.
In principle, it will be included in the next ANTLR (4.13.2).
Until then, feel free to use this instead of the current one.

You can rebuild it with:

cd .\node_modules\antlr4
npm run build

Although it may not be necessary to build if only the two constants are inserted (DEFAULT_CHANNEL/HIDDEN_CHANNEL).

SuraiyaBegumK · Answer 6 · Thu Mar 28 2024 21:57:29 GMT+0800 (China Standard Time)

Already added to Token.d.ts. In principle, it will be included in the next ANTLR (4.13.2). Until then, feel free to use this instead of the current one.

You can rebuild it with:
cd .\node_modules\antlr4
npm run build
Although it may not be necessary to build if only the two constants are inserted (DEFAULT_CHANNEL/HIDDEN_CHANNEL).

I have provided constants for DEFAULT_CHANNEL/HIDDEN_CHANNEL. It worked, Thank you Robein.