PhilippeSigaud / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reading input from the input range (or file)

p-mitana opened this issue · comments

I am trying to work with big files (SQL files ~9MB in size). I have the grammar which defines a single SQL instruction (sort of). I would like to parse the instructions from the input file one by one and avoid reading the entire file in the memory with readText, but it seems like currently it is impossible to do this.

If you can split your input into instructions outside of the grammar, you can read your file however you like and let your parser parse each instruction individually.

If. However, with SQL I can't reasonably do it - at least unless I want to create the other lexer which will split instructions on semicolons that are not part of strings.

As parsing does not require the entire input at once (it looks char by char anyway), I believe that reading an input range should is an important feature for a parsing library.

As parsing does not require the entire input at once

But it does. A rule can only succeed once all its sub-rules succeed. The top rule cannot succeed before the entire input has been read.

I don't remember what SQL looks like, but if it is basically a list of instructions and the parser does not need to do much backtracking, you may be able to define your grammar in a way that input after the first instruction is discarded (Instruction .* eoi). Then you may be able to read a portion of your file that is guaranteed to be large enough for any instruction, parse that, then progress your moving window buffer with the parsed input length. This way you will process your file instruction-per-instruction.

It depends.

If I had a rule that parses the entire SQL file at once then yes - it wll suceed only if it reads all the instructions and EOI.

However, I can have the rule, that does not end with EOI - such as SQL instruction. It can succeed multiple times along one input, ant it actually does. When I parse the long string, for example:

SELECT * FROM table1;
SELECT * FROM table2;

it will succeed and parse only the first instruction. After reading the first semicolon the SQLInstruction rule will succeed and all its sub-rules will as well. Then I can cut off the ParseTree's end property and parse again.

As parser iterates over string's character until the root rule either succeeds or fails without looking further than it needs, it can read the characters from the range as long as it needs them. The only concern is the lookahead feature, but in this case a ForwardRange requirement and saving the range on lookahead could do the trick.

Then you may be able to read a portion of your file that is guaranteed to be large enough for any instruction

Yes, I can do this of course. But I believe it is an overcomplication - as I need to either make assumption on how long the instruction will be or make several parsing attempts if the instruction is longer then expected or preparse the file and split instructions from each other. Having the parser library read my data from a range instead of string would remove this need at all.

I see. I don't see an easy way to do this, though.

Do you know iopipe? https://www.youtube.com/watch?v=9fzttyj4JCs (I have no personal experience with it, though). If you get a parse error because the instruction is longer than your buffer, you could increase the buffer size and retry.

I haven't heared about it yet. May be worth trying someday.

In case of these SQL files, I will probably have to tackle the problem in a very different way, as it turned out that parsing them (in future possibly many times bigger than currently) may consume too much memory.

Anyway, thank you for help and I hope anyway, that this issue will make its way into pegged sometime :)

If you can split your input into instructions outside of the grammar, you can read your file however you like and let your parser parse each instruction individually.

In this case, line numbering in error messages will be broken

Hi from 2024 :-)