[TypeScript]: emitToken(token Token) is not pushing new tokens to the TokensArray in CommonTokenStream
SuraiyaBegumK opened this issue · comments
I created a parser in TypeScript for YAML like language but my language is more simpler one. I want to handle the indents and dedents when newline token occurs
Issue 1: Observed that everytime I try to get next token using super.nextToken function it is directly calling emitToken() fucntion and pushing a token in tokens array. Due to this token is getting pushed before the check needs to be performed (Example: Added condition to skip the white spaces using Skip() but token is already pushed before reaching this line)
Issue 2: When I create instance of Lexer I can see all the tokens those got pushed (Even unnecessary tokens: Issue 1 ) but after trying convert them into TokenStream using 'CommonTokenStream', I can't see the tokens I pushed in the tokens array but I noticed in side tokenSource there are tokens but unable to access them
![image](https://private-user-images.githubusercontent.com/160007426/315899379-86dced6d-95dc-489d-9ded-ce65f0b57b86.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEzOTA0NjIsIm5iZiI6MTcyMTM5MDE2MiwicGF0aCI6Ii8xNjAwMDc0MjYvMzE1ODk5Mzc5LTg2ZGNlZDZkLTk1ZGMtNDg5ZC05ZGVkLWNlNjVmMGI1N2I4Ni5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzE5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxOVQxMTU2MDJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mMmUxY2FjY2Y2NDY5ZWVhZjAzZjk2MjQ3NzYxYjZiMjIwMTFhN2NjZDliNmI0YjdlYzcwMTM1MjZiNjhmNmU1JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.og6WKKeIgFk-3EffBXC2VGcofDCEyuu6ijxpdBQ0Fl4)
After passing Lexer to CommonTokenStream
Lexer Class in ANTLR4
`export declare class Lexer extends Recognizer {
static DEFAULT_MODE: number;
_input: CharStream;
_interp: LexerATNSimulator;
text: string;
line: number;
column: number;
_tokenStartCharIndex: number;
_tokenStartLine: number;
_tokenStartColumn: number;
_type: number;
constructor(input: CharStream);
reset(): void;
nextToken(): Token;
skip(): void;
more(): void;
more(m: number): void;
pushMode(m: number): void;
popMode(): number;
emitToken(token: Token): void;
emit(): Token;
emitEOF(): Token;
getAllTokens(): Token[];
}
`
Logic implemented to Push Indents and Dedents token array
`import { CharStream, Token, CommonToken, Lexer } from "antlr4";
import MyParser from "./MyParser";
export default class MyLexerBase extends Lexer {
/**
* Our document does not explicity provide begin and end nesting tokens, the indentation is used to determine the nesting level.
* So solve this, we need to keep track of the indentation level and emit the INDENT and DEDENT tokens when the indentation level changes.
* Multiple DEDENT tokens may be emitted if the indentation level decreases. The same needs to be sent to the parser without any input symbols.
* e.g.
* Example 1:
* if a == 1:
* print("a")
* print("b")
* if a > 2:
* print("c")
* print("b")
*
* Output: if a == 1: \n INDENT print("a") \n print("b") \n if a > 2: \n INDENT print("c") \n DEDENT DEDENT print("b") \n EOF
*
* Example 2:
* table Sales
* column id
* datatype: Int64
* primaryKey
* summarizeBy: None
*
* Output: table Sales NEWLINE INDENT column id NEWLINE INDENT datatype : Int64 NEWLINE primaryKey NEWLINE summarizeBy : None NEWLINE DEDENT DEDENT EOF
*
*/
tokens: any[];
indents: any[];
opened: number;
constructor(input: CharStream) {
super(input);
this.tokens = [];
this.indents = [];
this.opened = 0;
}
reset() {
// A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
this.tokens = [];
// The stack that keeps track of the indentation level.
this.indents = [];
// The amount of opened braces, brackets and parenthesis.
this.opened = 0;
super.reset();
}
// Override emit method to customize token emission if necessary
emitToken(token: Token) {
try {
super.emitToken(token);
this.tokens.push(token);
} catch (error) {
console.error("Error occurred during token emission");
// Handle the error as needed
}
}
// Override nextToken() to add custom logic for emitting additional tokens
public nextToken(): Token {
const nextToken: Token = super.nextToken();
if (nextToken.channel === 0) {
// If token is on default channel, handle it
// Add custom logic to emit additional INDENT and DEDENT tokens based on existing tokens
if (nextToken.type === MyParser.NEWLINE) {
// Emit additional tokens here
this.emitAdditionalToken();
// console.log("Text: NEWLINE" + ", Type: " + nextToken.type + ", Channel: " + nextToken.channel);
}
else if (nextToken.type === MyParser.WS) {
this.skip();
}
else {
// if not a NEWLINE token, return the token as-is
// console.log("Text: " + nextToken.text + ", Type: " + nextToken.type + ", Channel: " + nextToken.channel);
}
return nextToken;
} else {
// Otherwise, do nothing or handle it as per your requirement.
return nextToken;
}
}
/**
*
*/
emitAdditionalToken() {
let newLine = this.text.replace(/[^\r\n]+/g, '');
// Strip newlines inside open clauses except if we are near EOF. We keep NEWLINEs near EOF to
// satisfy the final newline needed by the single_put rule used by the REPL.
let next = this._input.LA(1);
let nextnext = this._input.LA(2);
if (this.opened > 0 || (nextnext != -1 /* EOF */ && (next === 13 /* '\r' */ || next === 10 /* '\n' */ || next === 35 /* '#' */))) {
// If we're inside a list or on a blank line, ignore all indents,
// dedents and line breaks.
this.skip();
} else {
this.emitToken(this.commonToken(MyParser.NEWLINE, newLine));
// let nextToken = this.nextToken();
// let spaces = nextToken.text.replace(/[\r\n]+/g, '');
// let cpos = this.getIndentationCount(spaces);
// Todo:
let cpos = 0;
this.emitToken(this.commonToken(MyParser.INDENT, " "));
console.log("Adding indent token to the stack.");
// console.log("Indentation count: " + cpos);
let previous = this.indents.length ? this.indents[this.indents.length - 1] : 0;
if (cpos === previous) {
// skip indents of the same size as the present indent-size
this.skip();
} else if (cpos > previous) {
console.log("Adding indent token to the stack.");
this.indents.push(cpos);
// console.log("Indentation level: " + cpos);
//this.emitToken(this.commonToken(MyParser.INDENT, spaces));
} else {
// Possibly emit more than 1 DEDENT token.
while (this.indents.length && this.indents[this.indents.length - 1] > cpos) {
this.emitToken(this.createDedent());
// console.log("Removing indent token from the stack.")
this.indents.pop();
}
}
}
}
createDedent() {
return this.commonToken(MyParser.DEDENT, "");
}
getCharIndex() {
return this._input.index;
}
commonToken(type: number, text: string) {
let stop = this.getCharIndex() - 1;
let start = text.length ? stop - text.length + 1 : stop;
return new CommonToken([this, this._input], type, 0, start, stop);
}
getIndentationCount(whitespace: string) {
let count = 0;
for (let i = 0; i < whitespace.length; i++) {
if (whitespace[i] === '\t') {
count += 8 - count % 8;
} else {
count++;
}
}
return count;
}
}
`
for YAML like language
This bug applies to which grammar in this repo, grammars-v4? We don't have a grammar for yaml.
I recommend that you try the Antlr4ng tool and runtime. There is a TypeScript target for Antlr 4.13.1, but it is unlikely any fixes will be addressed in that code. You can then raise an Github Issue over there if you see a problem with the runtime. You will need to use the Antlr4ng-cli tool to generate the updated code for your parser.
I have my own grammar I referred Python sample in this repo for Indent handling logic
// Override emit method to customize token emission if necessary
emitToken(token: Token) {
super.emitToken(token);
this.tokens.push(token);
}
where I am emitting Intent and Dedent custom tokens, they are not getting emitted
I don't know if it helps.
I use something like this to insert INDENT and DEDENT tokens for Python Lexers with TypeScript target.
There is no emitToken().
import { CharStream, Token, CommonToken, Lexer } from "antlr4";
import PythonLexer from "./PythonLexer";
import * as Collections from "typescript-collections";
export default abstract class PythonLexerBase extends Lexer {
// A stack that keeps track of the indentation lengths
private indentLengthStack!: Collections.Stack<number>;
// A list where tokens are waiting to be loaded into the token stream
private pendingTokens!: Array<Token>;
// last pending token types
private previousPendingTokenType!: number;
private lastPendingTokenTypeFromDefaultChannel!: number;
private curToken: CommonToken | undefined; // current (under processing) token
private ffgToken: Token | undefined; // following (look ahead) token
protected constructor(input: CharStream) {
super(input);
this.init();
}
private init(): void {
this.indentLengthStack = new Collections.Stack<number>();
this.pendingTokens = [];
this.previousPendingTokenType = 0;
this.lastPendingTokenTypeFromDefaultChannel = 0;
this.curToken = undefined;
this.ffgToken = undefined;
}
public nextToken(): Token { // reading the input stream until a return EOF
this.checkNextToken();
return this.pendingTokens.shift()!; // add the queued token to the token stream
}
private checkNextToken(): void {
if (this.previousPendingTokenType !== PythonLexer.EOF) {
this.setCurrentAndFollowingTokens();
if (this.indentLengthStack.isEmpty()) { // We're at the first token
this.handleStartOfInput();
}
switch (this.curToken!.type) {
case PythonLexer.NEWLINE:
this.handleNEWLINEtoken();
break;
case PythonLexer.EOF:
this.handleEOFtoken();
break;
default:
this.addPendingToken(this.curToken!);
}
}
}
private setCurrentAndFollowingTokens() {
this.curToken = this.ffgToken == undefined
? this.getCommonTokenByToken(super.nextToken())
: this.getCommonTokenByToken(this.ffgToken);
this.ffgToken = this.curToken.type === PythonLexer.EOF
? this.curToken
: this.getCommonTokenByToken(super.nextToken());
}
private handleStartOfInput() {
// initialize the stack with a default 0 indentation length
this.indentLengthStack.push(0); // this will never be popped off
}
private handleNEWLINEtoken() {
const nlToken = this.curToken!; // save the current NEWLINE token
const isLookingAhead = this.ffgToken!.type === PythonLexer.WS;
if (isLookingAhead) {
this.setCurrentAndFollowingTokens(); // set the next two tokens
}
switch (this.ffgToken!.type) {
case PythonLexer.NEWLINE: // We're before a blank line
case PythonLexer.COMMENT: // We're before a comment
this.hideAndAddPendingToken(nlToken);
if (isLookingAhead) {
this.addPendingToken(this.curToken!); // WS token
}
break;
default:
this.addPendingToken(nlToken);
if (isLookingAhead) { // We're on whitespace(s) followed by a statement
const indentationLength = this.ffgToken!.type === PythonLexer.EOF ?
0 :
this.getIndentationLength(this.curToken!.text);
this.addPendingToken(this.curToken!); // WS token
this.insertIndentOrDedentToken(indentationLength); // may insert INDENT token or DEDENT token(s)
} else { // We're at a newline followed by a statement (there is no whitespace before the statement)
this.insertIndentOrDedentToken(0); // may insert DEDENT token(s)
}
}
}
private insertIndentOrDedentToken(curIndentLength: number) {
let prevIndentLength: number = this.indentLengthStack.peek()!;
if (curIndentLength > prevIndentLength) {
this.createAndAddPendingToken(PythonLexer.INDENT, Token.DEFAULT_CHANNEL, "INDENT", this.ffgToken!);
this.indentLengthStack.push(curIndentLength);
} else {
while (curIndentLength < prevIndentLength) { // more than 1 DEDENT token may be inserted to the token stream
this.indentLengthStack.pop();
prevIndentLength = this.indentLengthStack.peek()!;
if (curIndentLength <= prevIndentLength) {
this.createAndAddPendingToken(PythonLexer.DEDENT, Token.DEFAULT_CHANNEL, "DEDENT", this.ffgToken!);
} else {
// this.reportError("inconsistent dedent");
}
}
}
}
private insertTrailingTokens() {
switch (this.lastPendingTokenTypeFromDefaultChannel) {
case PythonLexer.NEWLINE:
case PythonLexer.DEDENT:
break; // no trailing NEWLINE token is needed
default:
// insert an extra trailing NEWLINE token that serves as the end of the last statement
this.createAndAddPendingToken(PythonLexer.NEWLINE, Token.DEFAULT_CHANNEL, "NEWLINE", this.ffgToken!); // ffgToken is EOF
}
this.insertIndentOrDedentToken(0); // Now insert as much trailing DEDENT tokens as needed
}
private handleEOFtoken() {
if (this.lastPendingTokenTypeFromDefaultChannel > 0) {
// there was a statement in the input (leading NEWLINE tokens are hidden)
this.insertTrailingTokens();
}
this.addPendingToken(this.curToken!);
}
private hideAndAddPendingToken(cToken: CommonToken) {
cToken.channel = Token.HIDDEN_CHANNEL;
this.addPendingToken(cToken);
}
private createAndAddPendingToken(type: number, channel: number, text: string, baseToken: Token) {
const cToken: CommonToken = this.getCommonTokenByToken(baseToken);
cToken.type = type;
cToken.channel = channel;
cToken.stop = baseToken.start - 1;
cToken.text = text;
this.addPendingToken(cToken);
}
private addPendingToken(token: Token) {
// save the last pending token type because the pendingTokens list can be empty by the nextToken()
this.previousPendingTokenType = token.type;
if (token.channel === Token.DEFAULT_CHANNEL) {
this.lastPendingTokenTypeFromDefaultChannel = this.previousPendingTokenType;
}
this.pendingTokens.push(token);
}
private getCommonTokenByToken(oldToken: Token): CommonToken {
const cToken = new CommonToken([this, oldToken.getInputStream()], oldToken.type, oldToken.channel, oldToken.start, oldToken.stop);
cToken.tokenIndex = oldToken.tokenIndex;
cToken.line = oldToken.line;
cToken.column = oldToken.column;
cToken.text = oldToken.text;
return cToken;
}
private getIndentationLength(textWS: string): number {
const TAB_LENGTH = 8; // the standard number of spaces to replace a tab to spaces
let length = 0;
for (let ch of textWS) {
switch (ch) {
case " ":
length += 1;
break;
case "\t":
length += TAB_LENGTH - (length % TAB_LENGTH);
break;
}
}
return length;
}
public reset() {
this.init();
super.reset();
}
}
Hello RobEin
Thank you for sharing detailed logic.
Now I don't see DEFAULT_CHANNEL/HIDDEN_CHANNEL in the Token.d.ts. Could you please help me out here @RobEin .
![image](https://private-user-images.githubusercontent.com/160007426/317174125-a55abf9d-ad05-4fe2-9fbc-da27a85a961f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEzOTA0NjIsIm5iZiI6MTcyMTM5MDE2MiwicGF0aCI6Ii8xNjAwMDc0MjYvMzE3MTc0MTI1LWE1NWFiZjlkLWFkMDUtNGZlMi05ZmJjLWRhMjdhODVhOTYxZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzE5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxOVQxMTU2MDJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0xNWEzZGE2OTE1NWZkZGMwMTNjNWU2ZmI0ZmZkZTM3MDUxYzJjYWY0ZDM3YzAxYWMxOTczNTc3YTk2MWVjNGY2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.JMEgMcsOrjeok3Y5yUgOQDUZrExFKE2wiQ_jUkOMDYs)
Token.d.ts
import {CharStream} from "./CharStream";
export declare class Token {
static EOF: number;
tokenIndex: number;
line: number;
column: number;
channel: number;
text: string;
type: number;
start : number;
stop: number;
clone(): Token;
cloneWithType(type: number): Token;
getInputStream(): CharStream;
}
Already added to Token.d.ts.
In principle, it will be included in the next ANTLR (4.13.2).
Until then, feel free to use this instead of the current one.
You can rebuild it with:
cd .\node_modules\antlr4
npm run build
Although it may not be necessary to build if only the two constants are inserted (DEFAULT_CHANNEL/HIDDEN_CHANNEL
).
Already added to Token.d.ts. In principle, it will be included in the next ANTLR (4.13.2). Until then, feel free to use this instead of the current one.
You can rebuild it with:
cd .\node_modules\antlr4 npm run build
Although it may not be necessary to build if only the two constants are inserted (
DEFAULT_CHANNEL/HIDDEN_CHANNEL
).
I have provided constants for DEFAULT_CHANNEL/HIDDEN_CHANNEL. It worked, Thank you Robein.