Measuring lean overhead?
chadbrewbaker opened this issue · comments
The Lean4 version is about 10x slower than BSD wc on Apple Silicon. I will have to dig in tomorrow but I am guessing it is a combination of buffer size (5 seems small), and overhead in converting the buffer to a list before passing.
% /usr/bin/time -l cat mobydick.txt | ./build/bin/WordCount-lean
0.08 real 0.00 user 0.00 sys
1605632 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
180 page reclaims
2 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
34 voluntary context switches
13 involuntary context switches
7072637 instructions retired
3438399 cycles elapsed
1131264 peak memory footprint
Characters: 1276235 / Words: 218951 / Lines: 22317
% /usr/bin/time -l cat mobydick.txt | wc
0.00 real 0.00 user 0.00 sys
1605632 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
176 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
130 voluntary context switches
2 involuntary context switches
6196961 instructions retired
3683665 cycles elapsed
1114816 peak memory footprint
22316 215864 1276235
First of all, thank you for submitting this issue!
5 is definitely a sub-optimal buffer size! And I had spent zero time de-pessimizing/optimizing it at all as you can tell from the list conversion in IOfoldl
.
I also don't recommend the benchmarking methods used at all. Not enough science.
Despite all that I was kind of surprised that this program was within the same ballpark as my system wc
.
I'm sure this section could be much better and IOfoldl
could have a better implementation that doesn't waste memory and cycles on converting to list. I will see about remedying that if you wouldn't mind running this again on your hardware?
What do you think of this change? I updated the buffer size to something more reasonable and we use ByteArray.foldl
instead of the conversion and List.foldl
. This does seem to improve the program a decent amount.
Bytearray tends to be the fastest for Python/Rust - I haven’t delved into Lean’s. With a 4-5 state automata you can parse UTF-8 up to prefix coding correctness without the lookup table of legal/illegal glyphs.