Reading input from the input range (or file)

Question

Reading input from the input range (or file)

p-mitana opened this issue 6 years ago · comments

I am trying to work with big files (SQL files ~9MB in size). I have the grammar which defines a single SQL instruction (sort of). I would like to parse the instructions from the input file one by one and avoid reading the entire file in the memory with readText, but it seems like currently it is impossible to do this.

Bastiaan Veelo · Answer 1 · Wed Nov 14 2018 19:54:33 GMT+0800 (China Standard Time)

If you can split your input into instructions outside of the grammar, you can read your file however you like and let your parser parse each instruction individually.

P. Mitana · Answer 2 · Wed Nov 14 2018 20:42:41 GMT+0800 (China Standard Time)

If. However, with SQL I can't reasonably do it - at least unless I want to create the other lexer which will split instructions on semicolons that are not part of strings.

As parsing does not require the entire input at once (it looks char by char anyway), I believe that reading an input range should is an important feature for a parsing library.

Bastiaan Veelo · Answer 3 · Wed Nov 14 2018 20:52:05 GMT+0800 (China Standard Time)

As parsing does not require the entire input at once

But it does. A rule can only succeed once all its sub-rules succeed. The top rule cannot succeed before the entire input has been read.

Bastiaan Veelo · Answer 4 · Wed Nov 14 2018 21:07:15 GMT+0800 (China Standard Time)

I don't remember what SQL looks like, but if it is basically a list of instructions and the parser does not need to do much backtracking, you may be able to define your grammar in a way that input after the first instruction is discarded (Instruction .* eoi). Then you may be able to read a portion of your file that is guaranteed to be large enough for any instruction, parse that, then progress your moving window buffer with the parsed input length. This way you will process your file instruction-per-instruction.

P. Mitana · Answer 5 · Wed Nov 14 2018 21:07:52 GMT+0800 (China Standard Time)

It depends.

If I had a rule that parses the entire SQL file at once then yes - it wll suceed only if it reads all the instructions and EOI.

However, I can have the rule, that does not end with EOI - such as SQL instruction. It can succeed multiple times along one input, ant it actually does. When I parse the long string, for example:

SELECT * FROM table1;
SELECT * FROM table2;

it will succeed and parse only the first instruction. After reading the first semicolon the SQLInstruction rule will succeed and all its sub-rules will as well. Then I can cut off the ParseTree's end property and parse again.

As parser iterates over string's character until the root rule either succeeds or fails without looking further than it needs, it can read the characters from the range as long as it needs them. The only concern is the lookahead feature, but in this case a ForwardRange requirement and saving the range on lookahead could do the trick.

P. Mitana · Answer 6 · Wed Nov 14 2018 21:12:03 GMT+0800 (China Standard Time)

Then you may be able to read a portion of your file that is guaranteed to be large enough for any instruction

Yes, I can do this of course. But I believe it is an overcomplication - as I need to either make assumption on how long the instruction will be or make several parsing attempts if the instruction is longer then expected or preparse the file and split instructions from each other. Having the parser library read my data from a range instead of string would remove this need at all.

Bastiaan Veelo · Answer 7 · Wed Nov 14 2018 21:18:17 GMT+0800 (China Standard Time)

I see. I don't see an easy way to do this, though.

Bastiaan Veelo · Answer 8 · Wed Nov 14 2018 21:35:00 GMT+0800 (China Standard Time)

Do you know iopipe? https://www.youtube.com/watch?v=9fzttyj4JCs (I have no personal experience with it, though). If you get a parse error because the instruction is longer than your buffer, you could increase the buffer size and retry.

P. Mitana · Answer 9 · Thu Nov 15 2018 03:25:37 GMT+0800 (China Standard Time)

I haven't heared about it yet. May be worth trying someday.

In case of these SQL files, I will probably have to tackle the problem in a very different way, as it turned out that parsing them (in future possibly many times bigger than currently) may consume too much memory.

Anyway, thank you for help and I hope anyway, that this issue will make its way into pegged sometime :)

Denis Feklushkin · Answer 10 · Fri Jun 14 2024 03:46:38 GMT+0800 (China Standard Time)

If you can split your input into instructions outside of the grammar, you can read your file however you like and let your parser parse each instruction individually.

In this case, line numbering in error messages will be broken

Hi from 2024 :-)