goccy / bigquery-emulator

BigQuery emulator server implemented in Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve emulator performance for large projects

ohaibbq opened this issue · comments

What would you like to be added?

Hi @goccy @totem3, I'm filing this ticket as an epic to outline a few shortcomings we're running into.
I've undertaken work to address these issues and am nearing completion.

Our primary use-case facing these issues involves replacing a build that loads our view graph to BigQuery for testing. The graph currently encompasses ~2,500 tables and ~3,600 views, and ~3,000 tables materialized from views.

Outline of Issues

1. Project metadata middleware

Entire BigQuery project is loaded per request. When the project has thousands of tables, views, jobs, each request becomes very slow. The API data-access pattern should be changed to access only the data that pertains to a request.


2. SQLite always uses file-backed storage

In my tests, utilizing the memory-backed SQLite storage was nearly 3x faster. We should add an option to allow the emulator to run using in-memory storage.


3. go-zetasqlite usage leads to inefficient SQLite query plans

For complex BigQuery queries this is understandable, but for some heavily repeated queries that are known ahead of time (i.e. metadata.Repository.FindProjects()), we should try to avoid this.

This problem compounds when doing simple primary-key based lookups. If we were to utilize SQLite's CREATE TABLE ... WITHOUT ROWID functionality, simple lookups essentially become hash lookups. Without it, SQLite cannot predict the equality of values, so it must scan the entire table.

In order for SQLite to use its hash-based primary key lookup, we'd need to be using the native SQLite = operator.
go-zetasqlite rewrites these calls to use zetasqlite_equal(), which is unnecessary for the metadata repository.


4. Re-used repository queries do not use prepared statements


5. --data-from-yaml YAML parser is exceptionally slow

We have a script to populate our ~2,500 source table definitions into a data file to bootstrap the emulator.
Parsing this file takes many minutes when using YAML.

Parsing a JSON file with the same contents only takes ~75ms.

An alternative --data-from-json parameter to the binary.

Hi @ohaibbq,
Thank you for this summary and your work for improvement.
Since we also face performance problems, and the issues you summarized match what I found, I am excited to hear you are working on these performance issues.
I will try the Pull Requests you submitted.

Great. I will tag you in the PR that achieves the performance improvements.
I'm still piecing out the prerequisite PRs in go-zetasqlite.

From my observed benchmarks, we were able to load 2,500 tables into the emulator at startup via --data-from-json data.json --database :memory: in 8 seconds.
We are also able to load and materialize 1,400 views in 35 seconds.

@totem3 It will be a little while until the PR dependencies are merged into their respective @goccy repositories, but you should be able to build an emulator binary using this branch of our fork-

Recidiviz#12

@totem3 @goccy This is ready for your review if you'd like to take a look for more context on how the other PRs I opened up fit into the puzzle.
Recidiviz#12

I finally got the chance to compare runtimes now that all my PRs had been sorted out. A single request to the emulator takes ~18 seconds when the project has all of our 2,500 source tables added to it.

Now most all requests are sub 15ms.

@ohaibbq
Thank you for sharing the branch! It made verification much easier for me.
I haven't looked at the changes in detail yet, but I've tried it.
I am excited at how much faster the tests have become.

We mainly use the BigQuery emulator for testing. Previously, tests took over 15 minutes, but with this version, they finish in less than a minute. The tests are somewhat unstable, so they need to be checked. It’s possible that tests are failing due to concurrency issues because of the increased speed, so I am looking into that.

That's great to hear! The impact you are seeing is likely mostly due to the API data access refactor.

Our fork has only slightly diverged from the upstream repositories, some other notable performance improvements that may be improving your test times are here:
Recidiviz/go-zetasqlite#32
Recidiviz/go-zetasqlite#20