01mf02 / jaq

A jq clone focussed on correctness, speed, and simplicity

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extracting a single field from large files is a bit slow

JustinAzoff opened this issue · comments

conn_big.log.zstd is a json log containing 7,620,480 rows like

{"ts":1258531221.486539,"uid":"Cxf2Q62s0bGI7HZujj","id.orig_h":"192.168.1.102","id.orig_p":68,"id.resp_h":"192.168.1.1","id.resp_p":67,"proto":"udp","service":"dhcp","duration":0.16382,"orig_bytes":301,"resp_bytes":300,"conn_state":"SF","missed_bytes":0,"history":"Dd","orig_pkts":1,"orig_ip_bytes":329,"resp_pkts":1,"resp_ip_bytes":328}
❯ hyperfine -M 3 -L prog "jq,./target/release/jaq" -- "zstdcat ~/tmp/json_logs/conn_big.log.zst | {prog} -cr .proto > /dev/null"
Benchmark 1: zstdcat ~/tmp/json_logs/conn_big.log.zst | jq -cr .proto > /dev/null
  Time (mean ± σ):     25.292 s ±  0.059 s    [User: 26.299 s, System: 0.617 s]
  Range (min … max):   25.233 s … 25.351 s    3 runs
 
Benchmark 2: zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq -cr .proto > /dev/null
  Time (mean ± σ):     38.434 s ±  0.123 s    [User: 37.036 s, System: 3.545 s]
  Range (min … max):   38.304 s … 38.547 s    3 runs
 
Summary
  'zstdcat ~/tmp/json_logs/conn_big.log.zst | jq -cr .proto > /dev/null' ran
    1.52 ± 0.01 times faster than 'zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq -cr .proto > /dev/null'

Since this is often the only thing I do, I have a stupid simple tool called json-cut that does this using github.com/buger/jsonparser which shows how fast this sort of thing can be..

❯ hyperfine -M 3 -L prog "~/projects/json-cut/json-cut proto" -- "zstdcat ~/tmp/json_logs/conn_big.log.zst | {prog} > /dev/null"        
Benchmark 1: zstdcat ~/tmp/json_logs/conn_big.log.zst | ~/projects/json-cut/json-cut proto > /dev/null
  Time (mean ± σ):      2.077 s ±  0.005 s    [User: 3.458 s, System: 0.358 s]
  Range (min … max):    2.071 s …  2.080 s    3 runs
 

I notice the high sys time in this case appears to be from not using buffered IO on stdout. applying this small change makes it 8% faster overall:

diff --git a/jaq/src/main.rs b/jaq/src/main.rs
index 02b2390..2ec3e0b 100644
--- a/jaq/src/main.rs
+++ b/jaq/src/main.rs
@@ -3,6 +3,7 @@ use jaq_interpret::{Ctx, Filter, FilterT, ParseCtx, RcIter, Val};
 use std::io::{self, BufRead, Write};
 use std::path::PathBuf;
 use std::process::{ExitCode, Termination};
+use std::io::BufWriter;
 
 #[cfg(feature = "mimalloc")]
 #[global_allocator]
@@ -455,10 +456,11 @@ fn print(cli: &Cli, val: Val, writer: &mut impl Write) -> io::Result<()> {
     Ok(())
 }
 
-fn with_stdout<T>(f: impl FnOnce(&mut io::StdoutLock) -> Result<T, Error>) -> Result<T, Error> {
-    let mut stdout = io::stdout().lock();
-    let y = f(&mut stdout)?;
-    stdout.flush()?;
+fn with_stdout<T>(f: impl FnOnce(&mut BufWriter<io::StdoutLock>) -> Result<T, Error>) -> Result<T, Error> {
+    let stdout = io::stdout().lock();
+    let mut buffered = BufWriter::new(stdout);
+    let y = f(&mut buffered)?;
+    buffered.flush()?;
     Ok(y)
 }

that should probably only buffer stdout if stdout is not a tty though.

❯ hyperfine -M 3 -L prog "./target/release/jaq.orig,./target/release/jaq" -- "zstdcat ~/tmp/json_logs/conn_big.log.zst | {prog} -cr .proto > /dev/null"
Benchmark 1: zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq.orig -cr .proto > /dev/null
  Time (mean ± σ):     38.410 s ±  0.136 s    [User: 36.983 s, System: 3.557 s]
  Range (min … max):   38.259 s … 38.523 s    3 runs
 
Benchmark 2: zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq -cr .proto > /dev/null
  Time (mean ± σ):     35.687 s ±  0.202 s    [User: 36.368 s, System: 1.301 s]
  Range (min … max):   35.497 s … 35.899 s    3 runs
 
Summary
  'zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq -cr .proto > /dev/null' ran
    1.08 ± 0.01 times faster than 'zstdcat ~/tmp/json_logs/conn_big.log.zst | ./target/release/jaq.orig -cr .proto > /dev/null'