How does line break work when implementing a new TokenMaker?

Question

How does line break work when implementing a new TokenMaker?

Essay97 opened this issue 3 years ago · comments

Hi, I'm sorry if this question sounds stupid, I'm sure I could find the answer somewhere in the docs, I tried but I really can't wrap my head around it.
I'm implementing a custom TokenMaker "by hand" (it's kind of an exercise for me) and, reading the docs, I would have expected that when implementing getTokenList() the text parameter would "reset" on every line break, but it seems like it's always containing the whole document. I don't think it matters, but I'm extending AbstractTokenMaker using Kotlin.

Is this the expected behavior?
Does this mean that the whole document gets parsed on each key stroke? Do I have to somewhat manage performance for large documents?

Also, side question completely unrelated: I don't get how should I implement auto indentation

Thanks in advance!

siggemannen · Answer 1 · Sun Feb 20 2022 18:30:44 GMT+0800 (China Standard Time)

Hi, i wrote my own TSQLTokenMaker by hand (custom state handling with a bunch of if:s etc) and it was kind of pain in the backside :D But it's possible.

getTokenList accepts (Segment text, int initialTokenType, int startOffset)

Segment is a "view" over the underlying char array containing your document.
So, the char array inside the segment might contain whole text or just part of it, depending on the implementation. But when you handle segment, you are only allowed to "look" at the stuff that is limited by Segment.offset and Segment.count. See it as offset = char array start index and count is offset + count = end index.

char[] array = text.array;
int offset = text.offset;
int count = text.count;
int end = offset + count;
for (int i = offset; i < end; i++)
{
       //Do stuff with array[i];
}

initialTokenType is for multiline tokens that hangs off previous lines. You can look at it as if you were parsing previous line and the last token wasn't finished, like SELECT 'blablabla
blablab'
when you parse blablab' RTextArea will pass you previous token's "type" so you know you should continue to parse a character string, and not just start off from scratch.

Finally, startOffset is the "global" offset of the segment. So when you add the tokens, your token has start / end as local to the segment offset. So you have to add this startOffset to the token's start index:

        List<ISQLToken> l = TSQLLexer.parse(array, offset, end, ts /*previous state*/);
        //Loop tokens...
        ISQLToken t;
        for (int i = 0; i < l.size(); i++)
        {
            t = l.get(i);
            addToken(text, t.getStart(), t.getEnd(), t.getTokenType(), t.getStart() + newStartOffset);
        }

Then, there are some other stuff like nested comments on multilines etc... :D Basically, it's a mess.

I think you can look at how a couple of other tokenmakers are implemented to get the idea of what they're doing.

For auto identation, you implement an action and set it on your TokenMaker's getInsertBreakAction. See for example AbstractJFlexCTokenMaker

Arham · Answer 2 · Wed Feb 23 2022 19:51:51 GMT+0800 (China Standard Time)

@Essay97

Also, side question completely unrelated: I don't get how should I implement auto indentation
I also don't know how to do the auto indentation part.
But You should have already sent a notification to @bobbylight about this issue!

bobbylight · Answer 3 · Sun Mar 20 2022 13:03:02 GMT+0800 (China Standard Time)

@siggemannen is right about Segments - they are a Swing class designed to let you read text from JTextComponents without creating a lot of Strings. For performance reasons, the Swing team was effectively cheating and giving you an offset- and length-pointer into the text component's text, rather than allocating memory for a String copy, since a lot of stuff reads from the text component's Document. Many of the Swing API's that actually render text to the screen use Segment.

The character array may appear to always be your entire Document, but that's not guaranteed to the case. Indeed, for large documents you might get only a portion of it, for example. So only ever reach into the [offset, offset + count) range of the array. In practice, TokenMaker.getTokenList(Segment, int, int) will only get called for a single line of code at a time, so this is the largest (and smallest!) amount of code parsed at a time.

Two examples you could follow of a homebrew TokenMaker that read from a Segment are UnixShellTokenMaker and WindowsBatchTokenMaker.

As for auto-indentation, the simplest way is to override the getShouldIndentNextLineAfter(Token) method in your TokenMaker implementation to return true if the next line should be indented. This is very simplistic logic, with the idea that the "last" Token on a line is sufficient to decide whether to indent the next line, but in practice that's typically all that's necessary. Note there's no support for auto-"outdent" outside of curly braces doing so automatically, if you have overridden TokenMaker.getCurlyBracesDenoteCodeBlocks(). For reference, here is an example of how languages like C and Java have auto-indentation implemented in the library.

If you need more sophisticated auto-indentation logic, you'll need to again take @siggemannen's advice and override getInsertBreakAction() to return your own logic.