Degraded performance with multiple threads
dbackeus opened this issue · comments
On the Naming Things Discord, @oldmoe mentioned that while extralite
has superior performance in single threaded benchmark, it may not perform well in multi threaded scenarios. Seeing this I decided to try some benchmarking. My results appear to confirm this suggestion. But it also wouldn't surprise me if my benchmarking method could be improved.
Script
require "sqlite3"
require "extralite"
require "benchmark"
# Setup
File.delete("benchmark.sqlite3") if File.exist?("benchmark.sqlite3")
POOL_SIZE = 10
EXTRALITE_CONNECTIONS = POOL_SIZE.times.map do
db = Extralite::Database.new("benchmark.sqlite3")
db.execute("PRAGMA journal_mode = WAL")
db.execute("PRAGMA synchronous = NORMAL")
db.execute("PRAGMA journal_size_limit = 64000000")
db.execute("PRAGMA mmap_size = 128000000")
db.execute("PRAGMA cache_size = 2000")
db.execute("PRAGMA busy_timeout = 5000")
db
end
SQLITE3_CONNECTIONS = POOL_SIZE.times.map do
db = SQLite3::Database.new("benchmark.sqlite3")
db.execute("PRAGMA journal_mode = WAL")
db.execute("PRAGMA synchronous = NORMAL")
db.execute("PRAGMA journal_size_limit = 64000000")
db.execute("PRAGMA mmap_size = 128000000")
db.execute("PRAGMA cache_size = 2000")
db.execute("PRAGMA busy_timeout = 5000")
db
end
EXTRALITE_CONNECTIONS[0].execute("CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, created_at TEXT, updated_at TEXT) STRICT")
insert_statement = EXTRALITE_CONNECTIONS[0].prepare("INSERT INTO users (name, created_at, updated_at) VALUES (?, ?, ?)")
100.times do
insert_statement.execute("John Doe", Time.now.iso8601, Time.now.iso8601)
end
# Benchmark variations
THREAD_COUNTS = [1, 2, 4, 8]
LIMITS = [1, 10, 100]
CLIENTS = %w[extralite sqlite3]
# Benchmark
GC.disable
Benchmark.bm do |x|
LIMITS.each do |limit|
THREAD_COUNTS.each do |thread_count|
CLIENTS.each do |client|
GC.start
x.report("#{client.rjust('extralite'.length)} - limit: #{limit}, threads: #{thread_count}") do
threads = thread_count.times.map do |thread_number|
Thread.new do
start = Time.now
if client == "extralite"
10_000.times do
records = EXTRALITE_CONNECTIONS[thread_number].query_ary("SELECT * FROM users LIMIT #{limit}")
raise "Expected #{limit} but got #{length}" unless records.length == limit
end
else
10_000.times do
records = SQLITE3_CONNECTIONS[thread_number].query("SELECT * FROM users LIMIT #{limit}").entries
raise "Expected #{limit} but got #{length}" unless records.length == limit
end
end
end
end
threads.each(&:join)
end
end
puts
end
end
end
GC.enable
Results
These results are from a Hetzner server equipped with a Ryzen 5950x CPU running Ubuntu 22.04 / Ruby 3.3.0-rc1. I'm using extralite 2.3 and sqlite3 1.6.9.
user system total real
extralite - limit: 1, threads: 1 0.140148 0.000000 0.140148 ( 0.143628)
sqlite3 - limit: 1, threads: 1 0.128365 0.017156 0.145521 ( 0.146815)
extralite - limit: 1, threads: 2 0.389131 0.143870 0.533001 ( 0.353893)
sqlite3 - limit: 1, threads: 2 0.290792 0.005425 0.296217 ( 0.295629)
extralite - limit: 1, threads: 4 1.295837 0.741622 2.037459 ( 1.010532)
sqlite3 - limit: 1, threads: 4 0.568105 0.041114 0.609219 ( 0.608620)
extralite - limit: 1, threads: 8 2.167029 1.389204 3.556233 ( 1.739894)
sqlite3 - limit: 1, threads: 8 1.103070 0.026563 1.129633 ( 1.128541)
extralite - limit: 10, threads: 1 0.115242 0.012818 0.128060 ( 0.129256)
sqlite3 - limit: 10, threads: 1 0.294360 0.009548 0.303908 ( 0.303867)
extralite - limit: 10, threads: 2 0.706740 0.278774 0.985514 ( 0.730320)
sqlite3 - limit: 10, threads: 2 0.575911 0.019264 0.595175 ( 0.595926)
extralite - limit: 10, threads: 4 1.931356 1.165848 3.097204 ( 2.578283)
sqlite3 - limit: 10, threads: 4 1.124513 0.035563 1.160076 ( 1.159298)
extralite - limit: 10, threads: 8 4.063315 2.494903 6.558218 ( 5.356409)
sqlite3 - limit: 10, threads: 8 2.298358 0.022259 2.320617 ( 2.319177)
extralite - limit: 100, threads: 1 0.515307 0.000000 0.515307 ( 0.518008)
sqlite3 - limit: 100, threads: 1 1.762169 0.038139 1.800308 ( 1.803624)
extralite - limit: 100, threads: 2 5.331649 4.482695 9.814344 ( 9.385591)
sqlite3 - limit: 100, threads: 2 3.631310 0.000000 3.631310 ( 3.627896)
extralite - limit: 100, threads: 4 12.327495 9.248137 21.575632 ( 20.498224)
sqlite3 - limit: 100, threads: 4 7.212354 0.000000 7.212354 ( 7.209987)
extralite - limit: 100, threads: 8 26.625104 19.910282 46.535386 ( 44.283113)
sqlite3 - limit: 100, threads: 8 14.774608 0.000000 14.774608 ( 14.769324)
I've noted that run to run variance can be quite high. But the pattern seems clear in that latency of sqlite3
increases linearly when adding threads (as one would expect from a GVL locking implemnentation), while extralite
latency increase is a bit all over the place, but overall higher than linear, and probably higher than one would like. The jump from 1 to 2 threads is especially gruesome, eg. the 100 limit test being 10x slower when using 2 threads instead of 1.
Oldmoe's hypothesis is that excessive GVL releasing may be hurting performance. But it may require a deeper dive to reason about all the factors. Or maybe I'm just doing the benchmarking wrong 😅
Oldmoe's hypothesis is that excessive GVL releasing may be hurting performance. But it may require a deeper dive to reason about all the factors. Or maybe I'm just doing the benchmarking wrong 😅
I can confirm the results. I guess the cause for the slowness and high variance is the GVL releasing. This is certainly disappointing. I'll guess the only solution is to add a mode where the GVL is not released 😄 .
Maybe there's a sweet spot middle ground to figure out? I feel like, on the whole, some GVL release is necessary to prevent applications from completely freezing up on long running queries. Even a small performance penalty, if necessary, would be worth that flexibility.
Here's what moe suggested:
I think the idea of releasing the GVL for every statement is unnecessary overhead, moreover, releasing it for every row fetched is just not sensible. For multi threaded environments, I think the best approach would be to have a version of execute that runs without the GVL, and only for the first call of step, this way, application writers can use these for specific purposes, like for example the AR load_async interface, for SQLite this could run in a background thread with the GVL turned off during the first and only the first call to step,
Even in a fiber based environment, a background thread to run these long queries should be OK in most cases.
I don't have any insight into the implementation details so I'm not sure what it means, but it seems to suggest some middle ground might be possible.
I think the idea of releasing the GVL for every statement is unnecessary overhead, moreover, releasing it for every row fetched is just not sensible. For multi threaded environments, I think the best approach would be to have a version of execute that runs without the GVL, and only for the first call of step, this way, application writers can use these for specific purposes, like for example the AR load_async interface, for SQLite this could run in a background thread with the GVL turned off during the first and only the first call to step,
This sounds like a good idea. Let me see what I can do.
After implementing a gvl_mode
setting and adding db.gvl_mode = :hold
to the benchmark script, here are the results:
user system total real
extralite - limit: 1, threads: 1 0.011072 0.003621 0.014693 ( 0.014709)
sqlite3 - limit: 1, threads: 1 0.026727 0.004549 0.031276 ( 0.031354)
extralite - limit: 1, threads: 2 0.029291 0.003167 0.032458 ( 0.032408)
sqlite3 - limit: 1, threads: 2 0.052903 0.012266 0.065169 ( 0.065166)
extralite - limit: 1, threads: 4 0.057473 0.011788 0.069261 ( 0.069064)
sqlite3 - limit: 1, threads: 4 0.116377 0.028080 0.144457 ( 0.144495)
extralite - limit: 1, threads: 8 0.105049 0.016368 0.121417 ( 0.121112)
sqlite3 - limit: 1, threads: 8 0.216640 0.023925 0.240565 ( 0.240585)
extralite - limit: 10, threads: 1 0.024705 0.000084 0.024789 ( 0.025352)
sqlite3 - limit: 10, threads: 1 0.048553 0.008088 0.056641 ( 0.056730)
extralite - limit: 10, threads: 2 0.048144 0.000000 0.048144 ( 0.048180)
sqlite3 - limit: 10, threads: 2 0.095901 0.015965 0.111866 ( 0.111927)
extralite - limit: 10, threads: 4 0.083829 0.007898 0.091727 ( 0.091426)
sqlite3 - limit: 10, threads: 4 0.189711 0.024126 0.213837 ( 0.214113)
extralite - limit: 10, threads: 8 0.182138 0.011527 0.193665 ( 0.193720)
sqlite3 - limit: 10, threads: 8 0.388128 0.028402 0.416530 ( 0.416619)
extralite - limit: 100, threads: 1 0.104511 0.000000 0.104511 ( 0.104554)
sqlite3 - limit: 100, threads: 1 0.260976 0.016016 0.276992 ( 0.277149)
extralite - limit: 100, threads: 2 0.172389 0.011990 0.184379 ( 0.184530)
sqlite3 - limit: 100, threads: 2 0.512232 0.031894 0.544126 ( 0.544166)
extralite - limit: 100, threads: 4 0.313612 0.036185 0.349797 ( 0.349906)
sqlite3 - limit: 100, threads: 4 1.028803 0.039815 1.068618 ( 1.068644)
extralite - limit: 100, threads: 8 0.615005 0.063960 0.678965 ( 0.677928)
sqlite3 - limit: 100, threads: 8 2.079794 0.091753 2.171547 ( 2.171997)
I'll continue work on adding a hybrid GVL mode as per @oldmoe's suggestion, where the GVL is released on fetching the first row, then held on the rest and we'll see how that performs.
A note on David's supplied benchmark, it lacks a single threaded baseline, since the threads = 1 state has actually another thread spawned next to the main thread, I believe this is where Extralite works best, and will probably be slightly better without the GVL release.
Fixed in #46.