Measuring lean overhead?

Question

Measuring lean overhead?

chadbrewbaker opened this issue 3 years ago · comments

The Lean4 version is about 10x slower than BSD wc on Apple Silicon. I will have to dig in tomorrow but I am guessing it is a combination of buffer size (5 seems small), and overhead in converting the buffer to a list before passing.

% /usr/bin/time -l cat mobydick.txt | ./build/bin/WordCount-lean
        0.08 real         0.00 user         0.00 sys
             1605632  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 180  page reclaims
                   2  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  34  voluntary context switches
                  13  involuntary context switches
             7072637  instructions retired
             3438399  cycles elapsed
             1131264  peak memory footprint
Characters: 1276235 / Words: 218951 / Lines: 22317
% /usr/bin/time -l cat mobydick.txt | wc                        
        0.00 real         0.00 user         0.00 sys
             1605632  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 176  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                 130  voluntary context switches
                   2  involuntary context switches
             6196961  instructions retired
             3683665  cycles elapsed
             1114816  peak memory footprint
   22316  215864 1276235

J Kenneth King · Answer 1 · Tue Nov 23 2021 11:39:45 GMT+0800 (China Standard Time)

First of all, thank you for submitting this issue!

5 is definitely a sub-optimal buffer size! And I had spent zero time de-pessimizing/optimizing it at all as you can tell from the list conversion in IOfoldl.

I also don't recommend the benchmarking methods used at all. Not enough science.

Despite all that I was kind of surprised that this program was within the same ballpark as my system wc.

I'm sure this section could be much better and IOfoldl could have a better implementation that doesn't waste memory and cycles on converting to list. I will see about remedying that if you wouldn't mind running this again on your hardware?

J Kenneth King · Answer 2 · Tue Nov 23 2021 12:04:45 GMT+0800 (China Standard Time)

What do you think of this change? I updated the buffer size to something more reasonable and we use ByteArray.foldl instead of the conversion and List.foldl. This does seem to improve the program a decent amount.

Chad Brewbaker · Answer 3 · Mon Nov 29 2021 02:24:13 GMT+0800 (China Standard Time)

Bytearray tends to be the fastest for Python/Rust - I haven’t delved into Lean’s. With a 4-5 state automata you can parse UTF-8 up to prefix coding correctness without the lookup table of legal/illegal glyphs.