ipython / ipython

Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.

Home Page:https://ipython.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug in Tokenizer/Automatic Parenthesization for Python 3.12

zacharyrs opened this issue · comments

Hey there!

I've discovered a bug with the tokenizer and automatic forward-slash-parenthesization.

Specifically, the following will result in an error when run in IPython 8.25.0 on Python 3.12.3:

1| from pathlib import Path
2| 
3| (
4|     Path(".")
5|     / f")"
6|     / "a a a a a a a a a"
7| )

Interestingly, the issue is mitigated if the f-string on line 5 is:

  • removed/replaced by a plain string
  • has any character after the ending parenthesis
  • starts with an opening parenthesis (note if there's anything before, including a space, it'll fail)

From a little digging, the tokenizer starts a new line (in tokens_by_line) when it encounters the new line at the end of the f-string.
It looks like the end parenthesis in the f-string is becoming an FSTRING_MIDDLE token, and deincrementing parenlev.

See here

def make_tokens_by_line(lines:List[str]):
"""Tokenize a series of lines and group tokens by line.
The tokens for a multiline Python string or expression are grouped as one
line. All lines except the last lines should keep their line ending ('\\n',
'\\r\\n') for this to properly work. Use `.splitlines(keeplineending=True)`
for example when passing block of text to this function.
"""
# NL tokens are used inside multiline expressions, but also after blank
# lines or comments. This is intentional - see https://bugs.python.org/issue17061
# We want to group the former case together but split the latter, so we
# track parentheses level, similar to the internals of tokenize.
# reexported from token on 3.7+
NEWLINE, NL = tokenize.NEWLINE, tokenize.NL # type: ignore
tokens_by_line: List[List[Any]] = [[]]
if len(lines) > 1 and not lines[0].endswith(("\n", "\r", "\r\n", "\x0b", "\x0c")):
warnings.warn(
"`make_tokens_by_line` received a list of lines which do not have lineending markers ('\\n', '\\r', '\\r\\n', '\\x0b', '\\x0c'), behavior will be unspecified",
stacklevel=2,
)
parenlev = 0
try:
for token in tokenutil.generate_tokens_catch_errors(
iter(lines).__next__, extra_errors_to_catch=["expected EOF"]
):
tokens_by_line[-1].append(token)
if (token.type == NEWLINE) \
or ((token.type == NL) and (parenlev <= 0)):
tokens_by_line.append([])
elif token.string in {'(', '[', '{'}:
parenlev += 1
elif token.string in {')', ']', '}'}:
if parenlev > 0:
parenlev -= 1
except tokenize.TokenError:
# Input ended in a multiline string or expression. That's OK for us.
pass
if not tokens_by_line[-1]:
tokens_by_line.pop()
return tokens_by_line

This issue does not occur on an older version of Python (e.g., 3.11.x), even when running the latest version of IPython.

thanks for the report, i'll see what I can do.