n8ta / rawk

A very fast awk interpreter

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What is it?

A (WIP) bytecode multi-stack awk interpreter. The goal of rawk is the to be the fastest awk for all programs.

What makes it unique?

Typing (yes even in awk!)

rawk uses type inference to determine the types of variables: string, string numeric, number, array at compile time. This allows rawk to emit bytecode that is non-dynamic in many scenarios. For code like

{ a = 1; b = 2; print a + b; }

rawk emits this code for print a + b (Gscl means Global scalar)

   6 GsclNum(0)                              args: [[]]                 push: [[Num]]
   7 GsclNum(1)                              args: [[]]                 push: [[Num]]
   8 Add                                     args: [[Num, Num]]         push: [[Num]]

The add instruction knows its operands are numbers and will not need to check types at runtime.

pub fn add(vm: &mut VirtualMachine, ip: usize, _imm: Immed) -> usize {
    let rhs: f64 = vm.pop_num(); // pop from the numeric stack
    let lhs: f64 = vm.pop_num(); // again
    vm.push_num(lhs+rhs);        // add them and push
    ip + 1                       // advance
}

If the types of a and b are variable (like below)

   { if ($1) { a = 1; b = 2; } else { a = "1"; b = "2" } print (a + b); }

rawk has no significant advantage here and will add two more bytecode ops to convert string -> number. For print (a + b) rawk emits

   16 GsclVar(0)                              args: [[]]                 push: [[Var]]
   17 VarToNum                                args: [[Var]]              push: [[Num]]
   18 GsclVar(1)                              args: [[]]                 push: [[Var]]
   19 VarToNum                                args: [[Var]]              push: [[Num]]
   20 Add                                     args: [[Num, Num]]         push: [[Num]]

Var means variable which is the stack of values whose type could be string/strnum/number and whose types need to be checked at runtime.

Very fast IO

rawk uses a ring buffer to read from files without copying unless the data is needed. rawk's file reading is faster than all other awks I am aware of. I have not yet optimized output so I have no idea how it compares. Here's a comparison of various awks reading every line in a file storing it, and then printing the final value.

./assets/io.png

(onetrueawk is far to the right of this chart so I've omitted it)

Todo:

  1. Reading from stdin
  2. Native string functions
    1. index
    2. match
    3. split
    4. sprintf
  3. Redirect output to file
    • close() function
  4. Pattern Ranges
  5. The columns runtime should not duplicate work when the same field is looked up multiple times
  6. The columns runtime should support assignment
  7. Divide by 0 needs to print an error
  8. All the builtin variables that are read only:
    1. ARGC (float)
    2. FILENAME (str)
    3. FNR (float)
    4. NF (float)
    5. NR (float)
    6. RLENGTH (float)
    7. RSTART (float)
  9. Builtins that are read/write
    1. CONVFMT (str)
    2. FS (str)
    3. OFMT (str)
    4. OFS (str)
    5. ORS (str)
    6. RS (str)
    7. SUBSEP (str)
  10. Builtins that are arrays (in this impl read only)
    1. ARGV
    2. ENVIRON

License

Mawk is GPLv2 (./mawk-regex-sys/LICENSE) Quick Drop Deque is MIT (./quick-drop-deque/LICENSE) The combined project is GPLv2

Running the tests

Install other awks to test against (they should be on your path with these exact names)

  1. gawk (linux/mac you already have it)
  2. mawk - build from src
  3. goawk - need the go toolchain, then go get
  4. onetrueawk - super easy and fast build from src

Tests by default just check correctness against other awks and oracle result.

cargo test

Perf tests

If you want to run perf tests set the env var "jperf" to "true" and do a cargo build --release and cargo test -- --test-threads=1 first. This will test the speed of the release binary against other awks.

About

A very fast awk interpreter

License:MIT License


Languages

Language:C 54.3%Language:Rust 27.5%Language:TeX 4.1%Language:Yacc 3.9%Language:Pascal 2.2%Language:Ruby 1.8%Language:C++ 1.5%Language:Awk 1.0%Language:Roff 0.9%Language:Shell 0.7%Language:Makefile 0.7%Language:CWeb 0.7%Language:Lex 0.4%Language:M4 0.3%Language:Batchfile 0.1%Language:Berry 0.0%Language:RPC 0.0%Language:Stata 0.0%Language:Logos 0.0%Language:Standard ML 0.0%Language:Nextflow 0.0%Language:Max 0.0%Language:Forth 0.0%Language:E 0.0%Language:Filebench WML 0.0%Language:AMPL 0.0%