Extracting a single field from large files is a bit slow
JustinAzoff opened this issue · comments
Justin commented
conn_big.log.zstd is a json log containing 7,620,480 rows like
{"ts":1258531221.486539,"uid":"Cxf2Q62s0bGI7HZujj","id.orig_h":"192.168.1.102","id.orig_p":68,"id.resp_h":"192.168.1.1","id.resp_p":67,"proto":"udp","service":"dhcp","duration":0.16382,"orig_bytes":301,"resp_bytes":300,"conn_state":"SF","missed_bytes":0,"history":"Dd","orig_pkts":1,"orig_ip_bytes":329,"resp_pkts":1,"resp_ip_bytes":328}
❯ hyperfine -M 3 -L prog "jq,./target/release/jaq" -- "zstdcat ~/tmp/json_logs/conn_big.log.zst | {prog} -cr .proto > /dev/null"
Benchmark 1: zstdcat ~/tmp/json_logs/conn_big.log.zst | jq -cr .proto > /dev/null
Time (mean ± σ): 25.292 s ± 0.059 s [User: 26.299 s, System: 0.617 s]
Range (min … max): 25.233 s … 25.351 s 3 runs
Benchmark 2: zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq -cr .proto > /dev/null
Time (mean ± σ): 38.434 s ± 0.123 s [User: 37.036 s, System: 3.545 s]
Range (min … max): 38.304 s … 38.547 s 3 runs
Summary
'zstdcat ~/tmp/json_logs/conn_big.log.zst | jq -cr .proto > /dev/null' ran
1.52 ± 0.01 times faster than 'zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq -cr .proto > /dev/null'
Since this is often the only thing I do, I have a stupid simple tool called json-cut that does this using github.com/buger/jsonparser which shows how fast this sort of thing can be..
❯ hyperfine -M 3 -L prog "~/projects/json-cut/json-cut proto" -- "zstdcat ~/tmp/json_logs/conn_big.log.zst | {prog} > /dev/null"
Benchmark 1: zstdcat ~/tmp/json_logs/conn_big.log.zst | ~/projects/json-cut/json-cut proto > /dev/null
Time (mean ± σ): 2.077 s ± 0.005 s [User: 3.458 s, System: 0.358 s]
Range (min … max): 2.071 s … 2.080 s 3 runs
I notice the high sys time in this case appears to be from not using buffered IO on stdout. applying this small change makes it 8% faster overall:
diff --git a/jaq/src/main.rs b/jaq/src/main.rs
index 02b2390..2ec3e0b 100644
--- a/jaq/src/main.rs
+++ b/jaq/src/main.rs
@@ -3,6 +3,7 @@ use jaq_interpret::{Ctx, Filter, FilterT, ParseCtx, RcIter, Val};
use std::io::{self, BufRead, Write};
use std::path::PathBuf;
use std::process::{ExitCode, Termination};
+use std::io::BufWriter;
#[cfg(feature = "mimalloc")]
#[global_allocator]
@@ -455,10 +456,11 @@ fn print(cli: &Cli, val: Val, writer: &mut impl Write) -> io::Result<()> {
Ok(())
}
-fn with_stdout<T>(f: impl FnOnce(&mut io::StdoutLock) -> Result<T, Error>) -> Result<T, Error> {
- let mut stdout = io::stdout().lock();
- let y = f(&mut stdout)?;
- stdout.flush()?;
+fn with_stdout<T>(f: impl FnOnce(&mut BufWriter<io::StdoutLock>) -> Result<T, Error>) -> Result<T, Error> {
+ let stdout = io::stdout().lock();
+ let mut buffered = BufWriter::new(stdout);
+ let y = f(&mut buffered)?;
+ buffered.flush()?;
Ok(y)
}
that should probably only buffer stdout if stdout is not a tty though.
❯ hyperfine -M 3 -L prog "./target/release/jaq.orig,./target/release/jaq" -- "zstdcat ~/tmp/json_logs/conn_big.log.zst | {prog} -cr .proto > /dev/null"
Benchmark 1: zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq.orig -cr .proto > /dev/null
Time (mean ± σ): 38.410 s ± 0.136 s [User: 36.983 s, System: 3.557 s]
Range (min … max): 38.259 s … 38.523 s 3 runs
Benchmark 2: zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq -cr .proto > /dev/null
Time (mean ± σ): 35.687 s ± 0.202 s [User: 36.368 s, System: 1.301 s]
Range (min … max): 35.497 s … 35.899 s 3 runs
Summary
'zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq -cr .proto > /dev/null' ran
1.08 ± 0.01 times faster than 'zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq.orig -cr .proto > /dev/null'