Order of benchmarks seems to bias results

Question

Order of benchmarks seems to bias results

sinclairzx81 opened this issue 2 years ago · comments

Hi, thanks for this project.

I'm currently looking at submitting a new runtime type checker for comparative performance benchmarks, but have noticed that the order in which each validator is run seems to degrade subsequent benchmark results. By moving performing tests around, I'm able to greatly improve the performance results of tests by simply running them first.

I'm thinking this is likely due to some of the validators utilizing more memory overhead (implicating the GC), or that V8 optimization heuristics have broken down causing V8 to take the slow path (this is quite likely given how validation routines may require dynamic property access on objects to complete). To rule out the GC, I started the project with npm run compile:spectypes && node --expose-gc -r ts-node/register ./index.ts and called global.gc() to force collection prior to tests running, but no luck. So the problem is most likely related V8 optimization heuristics.

I think to give all tests an even footing, I'm wondering if it would be possible to reshape these tests such that each is run within it's own node process (rather than running all tests in the same process). This would give each a fresh instance in which to do it's thing, and should yield more accurate results.

Open to thoughts if this is something worth looking into.

Roman · Answer 1 · Sun May 15 2022 20:21:58 GMT+0800 (China Standard Time)

Hey @sinclairzx81, thank you for your feedback.

This is certainly a serious issue.

I am not an expert on Node internals or how GC works, beyond the CS 201 type of information :)

Do you have any proposals that are specific to this repo?

I had an idea to run every test in a separate GitHub Action job, which would provide the isolation, but then I was worried about the jobs being placed on different machines with different specs.

Running them in a separate process on the same machine sounds like good middle ground.

Wondering what would it take to do so, given our current setup.

Roman · Answer 2 · Sun May 15 2022 20:24:48 GMT+0800 (China Standard Time)

Is there anything, perhaps, in the settings of benny we could tweak to equalize the testing environment?

Maybe @caderek can recommend something?

sinclairzx81 · Answer 3 · Mon May 16 2022 14:24:37 GMT+0800 (China Standard Time)

@moltar Hi, thanks for the quick reply.

I agree that each test should be run on the same machine (not distributed across different machines on CI). I guess my initial thoughts on this were to just run each test in it's own V8 isolate. I guess this could be achieved a couple of ways, but perhaps the easiest is to have a cli command that is able to target an individual test. This would enable something like the following in package.json.

// runs spectypes only
{
   "scripts": {
      "start": "npm run compile:spectypes && ts-node index.ts --test spectypes"
   }
}

So, to run the complete test suite, you could just append the start script....

// runs everything
{
   "scripts": {
      "start": "npm run compile:spectypes && ts-node index.ts --test spectypes && ts-node index.ts --test zod && ... (omitted)"
   }
}

As the start command would likely get prohibitively long, googles zx package could be utilized to run a JavaScript based shell execution script to run each test and await them in turn.

// runner.mjs

// build spectype
await $`npm run compile:spectypes`

// load cases somehow, and execute in series (with each test run as an separate OS process)
for (const case of loadCases()) {
  await $`ts-node index.ts --test ${case}`
}

$ npx zx runner.mjs

I had a quick look through the code, I guess the only complication that might arise is handling aggregate results (writing benchmark output to node1x.json which gets sourced in the UI later on). Not sure if there is going to be an issue with each test run in isolation.

Open to thoughts.
Cheers
S

Erik Söhnel · Answer 4 · Mon May 16 2022 15:32:15 GMT+0800 (China Standard Time)

run on the same machine

👍

a cli command that is able to target an individual test

👍

I had a quick look through the code, I guess the only complication that might arise is handling aggregate results (writing benchmark output to node1x.json which gets sourced in the UI later on). Not sure if there is going to be an issue with each test run in isolation.

The results json format is pretty simple: an object in an array for each benchmark/module combination. That can be easily rewritten such that every isolated benchmark run appends its individual result to an already existing results json file. Benchmarks will obviously not run in parallel so there won't be any file transaction issues.

Two more thoughts/questions that come to my mind are:

Should we run all benchmarks of a module (assertString, parseSafe etc) in the same process or isolate that too?
Should only the chosen module be loaded or is it okay to load all modules (zod, io-ts, ...) even though only a single one is benchmarked?

sinclairzx81 · Answer 5 · Mon May 16 2022 16:03:52 GMT+0800 (China Standard Time)

@hoeck Hi!

Should we run all benchmarks of a module (assertString, parseSafe etc) in the same process or isolate that too?

This is a good question. I'm thinking maybe the best way to orchestrate these tests might be to organize the execution of the tests by package (vs by test). So in this regard, the process would start for a given validation package (i.e myzod), then execute all the tests for that validation package, then exit. The next package would be run immediately there after (in it's own process)

The thinking here is that, if performance degrades during the tests run for a single package, that's generally going to be a consequence of that package throttling V8 in some way. Package authors may be able to identify bottlenecks in throughput if they know the sequence in which the tests were run, and if the results show clear degradation across the tests run in sequence.

Should only the chosen module be loaded or is it okay to load all modules (zod, io-ts, ...) even though only a single one is benchmarked?

It might be a good idea to omit packages from import if they are not used in the test. It's probably going to be fine either way, but I guess a potential exists for packages to cache acceleration data structures on import, and such caching could bias results.

I should note, In the testing I've done so far, I have been importing all the tests in the suite, and haven't experienced a problem with them all being there. Degradation seems more tied to order of execution, with the importing of these packages seeming ok.

Hope this helps!
S

Erik Söhnel · Answer 6 · Mon May 16 2022 17:44:38 GMT+0800 (China Standard Time)

@sinclairzx81 @moltar or anyone else volunteering for this? Otherwise I'm going to try to implement @sinclairzx81 suggestions.

sinclairzx81 · Answer 7 · Mon May 16 2022 18:19:09 GMT+0800 (China Standard Time)

@hoeck Hi. that's great news :) I wasn't planning on undertaking these updates myself (as that work is best left for those benchmarking), but can help assist with the PR process by running some local tests and providing feedback prior to merge if it's helpful.

Keep me posted!

Roman · Answer 8 · Mon May 16 2022 18:26:10 GMT+0800 (China Standard Time)

Otherwise I'm going to try to implement

That would be great! 👍🏼 😁

Erik Söhnel · Answer 9 · Tue May 17 2022 21:46:11 GMT+0800 (China Standard Time)

@sinclairzx81 could you please try the changes from my PR?

When you execute npm start or npm run start or ./start.sh, benchmarks are now run for each module in their own npm process and the results are aggregated.

You can also do npm run start run zod myzod to only benchmark the listed modules at a time.

And most important, does the bias of the ordering go away or does it persist?

sinclairzx81 · Answer 10 · Tue May 17 2022 22:34:17 GMT+0800 (China Standard Time)

@hoeck Hi sure, just ran your branch and it looks like that did the trick. Seems many of the libraries there are giving much better results across the board now, with some crossing the 1 million operations per second mark which is nice to see! Hopefully the updates weren't too bad to implement. Also +1 for adding the ability run individual cases, very useful!

Good work mate!
S

Erik Söhnel · Answer 11 · Tue May 17 2022 23:23:05 GMT+0800 (China Standard Time)

Thx for the feedback @sinclairzx81 😊 and I am glad that this removed the result bias. I will fix up the linter issues with the PR tomorrow so that we can merge it.

Roman · Answer 12 · Wed May 18 2022 00:56:51 GMT+0800 (China Standard Time)

@hoeck amazing work, as usual! 👍🏼

I'm wondering if we shall add temporarily a notice banner at the top explaining this change, with maybe a link to this issue for details?

Erik Söhnel · Answer 13 · Wed May 18 2022 04:00:43 GMT+0800 (China Standard Time)

I'm wondering if we shall add temporarily a notice banner at the top explaining this change, with maybe a link to this issue for details?

That makes sense, I'll add a simple info message on top of the readme.