Generated parser still parsing whitespaces although they are disabled
Nafaryus27 opened this issue · comments
I have a grammar in which I needed to disable whitespace parsing so I put @@WHITESPACE::None
.
When using the parser with :
parser = tatsu.compile(grammar)
ast = parser.parse(text)
It behaves as expected, not parsing whitespaces and letting the rules I defined do their job.
Now, I tried generating parser code, so using first
python -m tatsu grammar.ebnf --outfile parser.py
, importing parser.py
and then using the MyParser
class that has been generated :
from parser import MyParser
parser = MyParser()
ast = parser.parse(text)
However, this gave me an error where it could not find a rule (something like rule= " some string ";
so a basic string rule containing whitespaces)
Now, I noticed that in the generated parser code, when it creates the ParserConfig
(in both Buffer
and Parser
class), it looks like this :
config = ParserConfig.new(
config,
owner=self,
whitespace=None,
nameguard=None,
ignorecase=False,
namechars='',
parseinfo=False,
comments_re=None,
eol_comments_re=None,
keywords=KEYWORDS,
start='start',
)
and by changing whitespace=None
to whitespace=''
(in both class) it fixes the issue and behaves as expected, not parsing the whitespaces.
Though, I don't want to have to go into the generated code and modify this each time I regenerate the parser, so I search a bit and found that there is a --whitespace
parameter that we can pass to tatsu when using the command line, but even if I put --whitespace ''
it still puts None
in the generated code.
I also found that specifying the whitespace=''
argument when using the parser also solves the issue:
from parser import MyParser
parser = MyParser(whitespace='')
ast = parser.parse(text)
However I find this to be just a temporary solution as I would prefer to have everything regarding the parser config/rules to be in the grammar. Also, since there is already the @@whitespace
directive in the grammar, one would expect that it works no matter the way we use the parser.
Please post a minimal grammar to test this?
Please also provide the version of TatSu you're using?
Also, have you tried using this in the grammar?
@@whitespace :: ''
This problem is probably caused by the configuration protocol (ParserConfig
) treating None
as an absent value, and not as the desired value.
A possible solution may be to make Parserconfig.whitespace = ''
, so no whitespace processing is done by default. It may be useful to disallow @@whitespace :: None
to avoid confusion.
I'm using version 5.12
I tried @@whitespace::''
but it cannot work as it's not a regexp, but even with //
(for an empty regexp) it does not worked either.
You can use this for example :
example.txt
:
This is a test
grammar.ebnf
:
@@whitespace::None
start = "This is a" test;
test = " test";
When using:
import tatsu
with open("example.txt", "r") as f:
text = f.read()
with open("grammar.ebnf", "r") as f:
grammar = f.read()
parser = tatsu.compile(grammar)
ast = parser.parse(text)
print(ast)
It gives the correct result :
('This is a', ' test')
However when using the generated parser (python3 -m tatsu grammar.ebnf --outfile parser.py
)
using python parser.py example.txt
gives this error (full error on pastebin):
tatsu.exceptions.FailedToken: example.txt(1:11) expecting ' test' :
This is a test
^
test
start
Which shows clearly that the parser skipped over the whitespace before "test", although it was not supposed to.
I also found that this behavior might have already been known since in parser_semantics.py
there is :
def grammar(self, ast, *args):
directives = {d.name: d.value for d in flatten(ast.directives)}
keywords = list(flatten(ast.keywords)) or []
if directives.get('whitespace') in {'None', 'False'}:
# NOTE: use '' because None will _not_ override defaults in configuration
directives['whitespace'] = ''
Which I guess is why there is no issue when using the parser with tatsu.compile(...)
So maybe do a similar thing as above in the parser code generator, or allow ''
as a possible value for @@whitespace::
Also, an other solution would be to set @@whitespace::
to an unused character (like '␟' or some other weird unicode character) but that's not very elegant...
I'll solve this on my next pass over TatSu.
If there's a pull request (that includes a unit test) before that, I'll merge it.
Thanks !