jruby / jruby-parser

JRuby's parser customized for IDE usage

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

StringTerm.parseString doesn't properly handle nested string interpolation with string containing '}'

sgtcoolguy opened this issue · comments

This is a hard one to describe without code, so here goes:

string = "here's some code: #{var = '}'; 1} there"

In this case, if we set up StringTerm to split embedded tokens, it will pick up the first } as the end of the dynamic expression, rather than the last one.

My best guess is that we'd read until the end of the string, then unread to the last unescaped '}' and use that as the end - but of course we'd have to worry about multiple dynamic expressions in the same string breaking that.

Perhaps a simple stack of quotes to help avoid it: match up pairs of single quotes, double quotes, /, so that if we encounter } inside we ignore?

The code differs a bit from JRuby proper (which parses this correctly). Some patches from NetBeans came in related to this and I suspect those patches broke balancing non-escaped chars. parseDExprIntoBuffer() in particular does not even exist in JRuby's version. So I predict a bug in that :)

After looking at this a few more minutes I think the issue is the parseDExprIntoBuffer() sucks. It does not consider additional nested strings and blindly picks up the first } it finds. Your simple stack is sort of what StringTerm is supposed to do, but there was additional logic to grab the DStr directly for EMBEDED_DEXPR.

Looking at a comment above that var:
// When StringTerm processes a string with an embedded code fragment (or variable),
// such as #{thiscode()}, it splits the string up at the beginning of the boundary
// and returns Tokens.tSTRING_DBEG or Tokens.tSTRING_DVAR. However, it doesn't
// split the string up where the embedded code ends, it just processes to the end.
// For my lexing purposes that's not good enough; I want to know where the embedded
// fragment ends (so I can lex that String as real Ruby code rather than just
// a String literal).

I am guessing this was meant as a way of getting just the dynamic expression out of the string so it can be parsed by itself? Probably an opt so when editing a dexp you only have to parse the immediate contents instead of the entire file? Sounds like a good feature if I am understanding this correctly.

It seems like this requires either a) recursion or b) simple stacks. So I agree with your idea. I just wish this code was a lot cleaner....

Yeah, I assumed this came from Netbeans to handle syntax coloring much like
I had to hack around it myself. If we don't split the embedded code, we end
up getting the start of the dynamic var region as a token, and then we get
from there to the end of the string as a single string content token.

I had to write up some hackneyed code to take that string and try and break
it apart to get the expression versus the actual string content - as we need
to grab the expression and recursively partition/lex to colorize as normal
ruby code.

(To quickly summarize eclipse's syntax coloring model, we break the code up
into large partitions: code, comment, regexp, string, etc; these partitions
must be non-overlapping. Then for each partition we tokenize to get
individual token colors/types. When editing, only the current partition's
bounds are typically re-tokenized to colorize again.)

I removed my hacks to use the hacks inside StringTerm for this, as it was
more correct :)

But yeah, in a perfect world, it seems to me like we'd need to recursively
lex/parse the dynamic expression to be able to properly determine the true
end brace. I don't have any great ideas on how this would be done off-hand
though.

On Thu, Sep 1, 2011 at 2:05 PM, enebo <
reply@reply.github.com>wrote:

The code differs a bit from JRuby proper (which parses this correctly).
Some patches from NetBeans came in related to this and I suspect those
patches broke balancing non-escaped chars. parseDExprIntoBuffer() in
particular does not even exist in JRuby's version. So I predict a bug in
that :)

After looking at this a few more minutes I think the issue is the
parseDExprIntoBuffer() sucks. It does not consider additional nested
strings and blindly picks up the first } it finds. Your simple stack is
sort of what StringTerm is supposed to do, but there was additional logic to
grab the DStr directly for EMBEDED_DEXPR.

Looking at a comment above that var:
// When StringTerm processes a string with an embedded code fragment (or
variable),
// such as #{thiscode()}, it splits the string up at the beginning of
the boundary
// and returns Tokens.tSTRING_DBEG or Tokens.tSTRING_DVAR. However, it
doesn't
// split the string up where the embedded code ends, it just processes
to the end.
// For my lexing purposes that's not good enough; I want to know where
the embedded
// fragment ends (so I can lex that String as real Ruby code rather than
just
// a String literal).

I am guessing this was meant as a way of getting just the dynamic
expression out of the string so it can be parsed by itself? Probably an opt
so when editing a dexp you only have to parse the immediate contents instead
of the entire file? Sounds like a good feature if I am understanding this
correctly.

It seems like this requires either a) recursion or b) simple stacks. So I
agree with your idea. I just wish this code was a lot cleaner....

Reply to this email directly or view it on GitHub:
#2 (comment)

This seems to work for me - guess it's been fixed over time. I would close unless it's still broken for someone.

Closing - please reopen if you can demonstrate the problem, but it's been 2 years and seems to work for me.