halaxa / json-machine

Efficient, easy-to-use, and fast PHP JSON stream parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance improvements (lexer/parser)

fcaps opened this issue · comments

commented

Hi guys,

the memory usage is awesome but the cpu-time is ~100x compared to json_decode (100MB json with 10000 entries).
Did you consider using a c-extension for the tokenizing/parsing?
Never wrote a extension, but looks like we could extend ext-json
or even just use ext-parle for the heavy lifting.

Could try to implement a lexer with ext-parle and look how the performance changes and then implement a parser if you guys think this is a good idea.

Greeting

Hi,

I was considering that and looked into it a little bit but I do not have the knowledge. I was fiddling with it in zephir branch. Learning to code extensions in pure C would be fun and is appealing to me but I do not have the time to do it right now. I am open to that if anyone else does.

I was looking into Parle earlier. I am worried about the lack of documentation. Is it even able to consume the source lang (json) chunks iteratively? If it is feel free to write a prototype. A lexer producing json tokens which would be interchangable with JsonMachine\Lexer might be a good start.

commented

sorry, was busy^^
had a deep look into parle, yeah... no documentation for PHP, but there is some for the original c implementation lexertl.
The first working lexer working with parle was terrible x2 slower compared to the pure php implementation.
The second lexer was "state aware" and was much faster, but at this stage it's almoast a parser.

Current State:

  • Prototype is working (Lexer only) but slow and not easy to debug
  • Implementation is working with "chucks/streams" -> working with a local buffer if something looks wrong (just waiting for the next chunk)
  • Lexer is "state-aware"

Open Questions:

  • How to "select" a Lexer/Parser in the future for JsonMachine, since the Parser/Lexer is Hardcoded?
  • Is there a simpler option to improve performance? maybe a "ArrayItemMatcher" that just can handle chunks and returning raw JSON values that json_decode can handle (object/array) and handle (string/number/null/true/false) directly/with hacks.

Next Steps:

  • Creating a test for performance comparison
  • Investigate "ArrayItemMatcher"
  • Removal/Ignoring of unused whitespaces/junk (every iteration hurts in php), stream pre-processor to the rescue?
  • Working to have a "clean" way to define tokens, like the ext-php scanner, so maintenance/debugging is easy
  • Having a look into the RParser if it fits our needs
  • Publish a alpha-version for you/others

You seem dedicated :) Can you show some code?

Are you using tests to verify correctness?

I would elaborate on other topics when we see a significant impact on performance.

While looking for a faster alternative, I found this https://github.com/shevron/ext-jsonreader
It's written in C and offers streaming as well. I haven't done any testing, but before you guys start working on something new you might want to check it out. Unfortunately it's quite old, but it might still give you a head start.

Let's contiune in #97