Deflate Parser + Speed Improvements

Question

Deflate Parser + Speed Improvements

marcioAlmada opened this issue 11 years ago · comments

Parser method is too long and it's getting complicated. CRAP index is currently around 8. This needs a fix before version 2.

Márcio Almada · Answer 1 · Tue Dec 17 2013 22:22:54 GMT+0800 (China Standard Time)

15% faster now af0b221

LAHAXE Arnaud · Answer 2 · Tue Dec 17 2013 22:31:49 GMT+0800 (China Standard Time)

Nice work !

Márcio Almada · Answer 3 · Wed Dec 18 2013 20:10:15 GMT+0800 (China Standard Time)

😆 still we have so much room for improvements, though

Márcio Almada · Answer 4 · Fri Dec 20 2013 23:57:08 GMT+0800 (China Standard Time)

This latest improvement ebabf85 benefits dynamic annotations only, usually the most frequent ones.

ignace nyamagana butera · Answer 5 · Sat Dec 28 2013 16:59:19 GMT+0800 (China Standard Time)

Hi,
Is it possible to add your speed tests to the source code ? So that if someone make a change or try to improve the code he/she can directly test it against your performance tests.

Márcio Almada · Answer 6 · Sun Dec 29 2013 04:10:26 GMT+0800 (China Standard Time)

Hi @nyamsprod,

I'm currently using the unit tests as benchmark. Steps:

disable xdebug extension (very important)
run phpunit tests at least 300 times against old code: phpunit --repeat 300
wait for cpu to settle down (just in case it's necessary)
run phpunit tests the same number of times against optimized code: phpunit --repeat 300
compare time and memory usage

I also just added some test groups [parser, bag, facade] so it gets easier to aim at some specific code. For this issue, I'm using just the parser group for measurements: phpunit --group parser.

Hope it was helpful. Cheers!

PS: Due to license reasons, some linux distros are using different versions of json extension. Parser relies a lot on json_decode so results may vary (just a little). Usually ext-json is much faster than pecl-json-c. To check which extension you're using just run defined(JSON_C_VERSION). True means you're using pecl-json-c, false means you're using json-ext.

Márcio Almada · Answer 7 · Sun Dec 29 2013 04:28:09 GMT+0800 (China Standard Time)

The current bottleneck is the implicit boolean type. It depends too much on incremental string scanning 😉

if ($line->scanImplicitBoolean($identifier_pattern)) { // if implicit boolean
    $parameters[$key][] = true;
    while (! $line->hasTerminated()) {
        $line->skip("/\\{$identifier_pattern}/");
        $key = $line->scanKey($key_pattern);
        $parameters[$key][] = true;
    }
    continue;
}

ignace nyamagana butera · Answer 8 · Sun Dec 29 2013 07:29:32 GMT+0800 (China Standard Time)

Thanks for the response, I'll use your method then for test standardisation. For the json extension, I'm well aware of the differences and I'm already taking into account those changes in my codes.

ignace nyamagana butera · Answer 9 · Sun Dec 29 2013 20:52:19 GMT+0800 (China Standard Time)

I have a problem with your optimizations on the implicit loop.
First, I've added the following property to AnnotationsFixture class :

    /**
     * @get @post @ajax float 2.1
     */
    private $multiple_values_fixture2;

Then I've added the following test in the ParserTest class:

    public function testParseMultipleValuesFixture2()
    {
        $res = $this->getParser('multiple_values_fixture2')->parse();
        $this->assertSame(['get' => true, 'post' => true, 'ajax' => 2.1], $res);
    }

This test does not work but you code introduce an infinite loop !! so the Unit testing does not end at all. I'm still working on the code but I think this bug is important enough to be listed here

Márcio Almada · Answer 10 · Sun Dec 29 2013 21:48:41 GMT+0800 (China Standard Time)

Interesting! In fact, this has never been tested before. I always assumed that no one would ever do this, so I never even tried to see what happens.

Well, you found one more motive to stop using while + incremental scanning. A simple one pass preg_match_all is the best solution IMMO, both for performance and to avoid creation of parser black holes like this one ;)

Would you mind to create another issue for this? Also, to keep consistency, we should name the fixture like this:

/**
 * @get @post @ajax float 2.1
 */
private $bad_implicit_boolean_fixture;

The original intentions with the implicit boolean annotations was that they shouldn't have explicit values (because their presence already means the value is true) and that only them can be declared in the same line. Annotations with declared values should never be in the same line with others.

ignace nyamagana butera · Answer 11 · Sun Dec 29 2013 22:46:22 GMT+0800 (China Standard Time)

I've resolve the problem while keeping the "while" :) soon on Github. I've also rewrote the Scanner class. I'll upload my changes so that you can see what I did but it's still works on progress

Márcio Almada · Answer 12 · Sun Dec 29 2013 22:49:50 GMT+0800 (China Standard Time)

When you say "rewrote", you mean "from scratch"?

ignace nyamagana butera · Answer 13 · Sun Dec 29 2013 22:51:34 GMT+0800 (China Standard Time)

No .. I have simplified the class I've just push the code on my account https://github.com/nyamsprod/annotations/tree/parser-improvement

Márcio Almada · Answer 14 · Sun Dec 29 2013 22:57:18 GMT+0800 (China Standard Time)

Well, these are some very substantial changes. Did you achieve any relevant optimization?

ignace nyamagana butera · Answer 15 · Sun Dec 29 2013 22:59:27 GMT+0800 (China Standard Time)

yes .. but the last stable does not work on my local machine and when I'm trying phpunit --group parser --repeat 30 phpunit does not run the code but I don't know why ? but when use separately (group and repeat) works

ignace nyamagana butera · Answer 16 · Sun Dec 29 2013 23:00:07 GMT+0800 (China Standard Time)

so I want to confirm the performance gain before stating that the code is OK. The main idea behind the performance optimization is less function calls but still keeping the code readable and well decoupled. so for instance the json detection is done only once and I avoid wrapping php native function in methods

Márcio Almada · Answer 17 · Sun Dec 29 2013 23:04:58 GMT+0800 (China Standard Time)

I fired an issue reporting this phpunit bug yesterday sebastianbergmann/phpunit#1085, no response yet.

But testing with phpunit --repeat 300 might be enough to reveal any relevant optimization. A bit noisy, but the relevant numbers are the ones that can emerge from noise anyway.

Márcio Almada · Answer 18 · Sun Dec 29 2013 23:07:19 GMT+0800 (China Standard Time)

yes .. but the last stable does not work on my local machine and when I'm trying...

The lastest stable phpunit or minime/annotations?

ignace nyamagana butera · Answer 19 · Sun Dec 29 2013 23:14:30 GMT+0800 (China Standard Time)

The lastest stable minime/annotations I have a error on the float fixture test

Márcio Almada · Answer 20 · Sun Dec 29 2013 23:18:45 GMT+0800 (China Standard Time)

We really need to track this down. It might be related to the json\ext being superseded by pecl-json-c extension. Please, could you create a new issue reporting the problem? You can also use issue #20 too.

ignace nyamagana butera · Answer 21 · Sun Dec 29 2013 23:37:34 GMT+0800 (China Standard Time)

yes of course let met do this now otherwise I migh forget 👍

Márcio Almada · Answer 22 · Sun Dec 29 2013 23:38:45 GMT+0800 (China Standard Time)

ok, thanks

ignace nyamagana butera · Answer 23 · Mon Dec 30 2013 06:03:21 GMT+0800 (China Standard Time)

I have:

upgraded my json-c lib
added the ReaderTest class into the unit test facade group
added a ScannerTest class to have a 100% code coverage for my Scanner class and added it into the unit test parser group
compared the master branch to the parser-improvement branch with the following settings:
phpunit --exclude-group bag,facade --repeat 300

Here is my results:

Time: 3.4 minutes, Memory: 35.25Mb (master branch)
Time: 1.55 minutes, Memory: 20.25Mb (parser-improvement branch)

So there's a 45% improvement... but you should run the test to see by yourself

Márcio Almada · Answer 24 · Mon Dec 30 2013 06:12:26 GMT+0800 (China Standard Time)

I'm sorry, you really meant minutes? I usually run phpunit --repeat 300 in 2.8 seconds or less...

ignace nyamagana butera · Answer 25 · Mon Dec 30 2013 06:13:17 GMT+0800 (China Standard Time)

must be xdebug presence .. I should disable it :)

ignace nyamagana butera · Answer 26 · Mon Dec 30 2013 06:27:39 GMT+0800 (China Standard Time)

Okay now with xdebug disabled I get:

Time: 15.64 seconds, Memory: 7.25Mb (master branch)
Time: 7.72 seconds, Memory: 5.00Mb (parser-improvement branch)

So the conclusion stay the same. I should point that my dev computer is not very fast :) (Intel® Pentium(R) D CPU 2.80GHz × 2 ) so with a more modern computer I'm sure it would be faster.

Márcio Almada · Answer 27 · Mon Dec 30 2013 06:39:12 GMT+0800 (China Standard Time)

Tried to merge the code on a test branch. Sooo much conflicts.

ignace nyamagana butera · Answer 28 · Mon Dec 30 2013 06:43:16 GMT+0800 (China Standard Time)

Yes probably because of the Unit tests I had to rewrite :( I think you can completely remove the ParserTest and with the one from the master branch. I have the bad habit to overly renamed function with the standard test prefix ... my fault

Márcio Almada · Answer 29 · Mon Dec 30 2013 06:50:41 GMT+0800 (China Standard Time)

Yes, that prevented me to merge a lot of your contributions lately 😄. Could you please create another branch based on current develop branch and commit changes there? without the unnecessary code, of course. Develop is always ahead master, so if you base yourself on master there will always be many conflicts to solve.

Also, if you're optimizing something, you shouldn't touch already existent unit tests, just add new ones. Let the modified tests for another pull request.

Code standard fixes should be on a separate pull requests too, but please skip all those PSR-1 and PSR-2 related stuff, this is done automatically with code fixing tools.

Márcio Almada · Answer 30 · Mon Dec 30 2013 06:52:01 GMT+0800 (China Standard Time)

Please do that so I can test the code here too without loose track of the really important changes.

ignace nyamagana butera · Answer 31 · Mon Dec 30 2013 06:55:27 GMT+0800 (China Standard Time)

Ok I'll do that tomorrow morning

Márcio Almada · Answer 32 · Mon Dec 30 2013 07:00:17 GMT+0800 (China Standard Time)

Nice, I'll comment the code in your branch.

Márcio Almada · Answer 33 · Mon Dec 30 2013 08:53:03 GMT+0800 (China Standard Time)

@nyamsprod I just cloned your repository and run the tests against current develop. I don't understand how you got this 45% improvement rating... your code is running slower than current develop branch even without the latest applied JSON_PARSER_NOTSTRICT patch.

I found quite strange that you said it was running so faster, none of the commits seemed to really improve speed. Here are the results:

pecl-json-c 1.3.2

ext-json

With ext-json, difference against the improvements gets even more noticeable:

I guess you're probably running your benchs against a very very outdated master, right? Something has to be wrong cause ones don't simply improve 45% of speed on a specific machine only with not so critical source code changes.

ignace nyamagana butera · Answer 34 · Mon Dec 30 2013 16:48:03 GMT+0800 (China Standard Time)

yes I've added an upstream branch on my local machine to keep with the changes in master and now the performance gain is indeed lower

Márcio Almada · Answer 35 · Mon Dec 30 2013 18:26:58 GMT+0800 (China Standard Time)

I guess we should focus on improve this https://github.com/marcioAlmada/annotations/blob/master/src/Minime/Annotations/Parser.php#L67 before any big step:

if ($line->scanImplicitBoolean($identifier_pattern)) { // if implicit boolean
    $parameters[$key][] = true;
    while (! $line->hasTerminated()) {
        $line->skip("/\\{$identifier_pattern}/");
        $key = $line->scanKey($key_pattern);
        $parameters[$key][] = true;
    }
    continue;
}

To something like this:

if ($line->checkImplicitBoolean($identifier_pattern)) { // if implicit boolean
    //  a single regex call here: `preg_match_all`
    // merge results into parameters all at once
    continue;
}

Márcio Almada · Answer 36 · Mon Dec 30 2013 18:28:49 GMT+0800 (China Standard Time)

Or maybe refactor this https://github.com/marcioAlmada/annotations/blob/master/src/Minime/Annotations/Parser.php#L61, so it comprehends the implicit boolean types before we get into the parsing loop.

ignace nyamagana butera · Answer 37 · Mon Dec 30 2013 19:52:32 GMT+0800 (China Standard Time)

The thing I don't get is why you want to removed the while ? On its own the while construct is not a bottleneck in the code. Micro-optimization can be dangerous if done everywhere. The code as it stand as a good CRAP index (around 3 or 4 I think) and is easily readable adding more optimization can lead to bug like the implicit boolean one. And don't forget that you want to add multi lines value parsing. so a if will have to be added somehow in the parse method as some point which will lead to a greater CRAP index.

Márcio Almada · Answer 38 · Mon Dec 30 2013 20:01:34 GMT+0800 (China Standard Time)

The problem is not the while itself, it's the string scanning operations.

StrScan class uses regex to advances towards the end of the string, this is like walking, step by step, like when someone is blind and doesn't know the path.

When using a single regex call (using a more complex regex), it's like doing a big jump, like when someone can see what's in front and can just jump to the end of the path. This was done before here: af0b221

ignace nyamagana butera · Answer 39 · Mon Dec 30 2013 21:59:24 GMT+0800 (China Standard Time)

Oki no problem .. I've updated my parser-improvement branch with your suggestion. So I no longer require StrScan at all .. I did not change the composer file .. I'll leave it to you this change. The boost is significiant .. I've gained almost 2 seconds 👍

Márcio Almada · Answer 40 · Tue Dec 31 2013 00:33:43 GMT+0800 (China Standard Time)

I cloned your fork again with the latest updates and run the same tests again. 😄 WE'RE FASTER NOW (10~12%) + your code looks nice and clean. So we have:

no more incremental string scanning
less code to maintain

Márcio Almada · Answer 41 · Tue Dec 31 2013 00:49:43 GMT+0800 (China Standard Time)

Awesome result! I would love to merge the new optimized code.

ignace nyamagana butera · Answer 42 · Tue Dec 31 2013 04:50:59 GMT+0800 (China Standard Time)

Not yet, I've found a bug in the data_pattern regex I'll work on a fix tomorrow

Márcio Almada · Answer 43 · Wed Apr 09 2014 01:41:47 GMT+0800 (China Standard Time)

Thanks for the improvements, specially @nyamsprod. I think we can close this one for now and reopen if necessary.

Cheers!