Standard-compliant `wc` implementation
ashvardanian opened this issue · comments
The 4c738ea commit introduces a prototype for StringZilla-based Command-Line toolkit, including the wc
utility replacement. The original prototype suggests a 3x performance improvement opportunity, but it can't currently handle multiple inputs and flags. Those must be easy to add in cli/wc.py
purely in the Python layer. We are aiming to match the following specification:
$ wc --help
Usage: wc [OPTION]... [FILE]...
or: wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified. A word is a non-zero-length sequence of
characters delimited by white space.
With no FILE, or when FILE is -, read standard input.
The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
-c, --bytes print the byte counts
-m, --chars print the character counts
-l, --lines print the newline counts
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-L, --max-line-length print the maximum display width
-w, --words print the word counts
--help display this help and exit
--version output version information and exit
GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Report any translation bugs to <https://translationproject.org/team/>
Full documentation <https://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'
I've merged some intermediate patches by @ghazariann, but some parts have to be reimplemented. Like this:
if args.max_line_length:
max_line_length = max(len(line) for line in str(mapped_bytes).split("\n"))
counts["max_line_length"] = max_line_length
It is expensive to convert to str
and even more expensive to split
it.
We're missing tests and don't handle locale.
Some thoughts on test
Stdin
Redirection - Note that --files0-from needs to pull a nul delimited list of filenames
find . -name '*.[ch]' -print0 | wc -L --files0-from=-
cat xxx | wc -l
Word Count
We only count spaces so add tests for adjacent and other whitespace.
Line Count
If a file ends in a non-newline character, its trailing partial line is not counted.
Max Line Length
Tabs are set at every 8th column. Display widths of wide characters are considered. Non-printable characters are given 0 width.
Locale
-m --chars Print only the character counts, as per the current locale. ( utf-8, and utf-16 support needed ) Encoding errors are not counted. locale.getencoding / setencoding
- We'd need to scan for non ascii codepoints in the input.
-w --words Uses locale specific whitespace.
- wc likely doesn't really do this per locale. We'd need to test a few locales. To do this we'd have to scan for non ascii and have a list of unicode whitespace to compare.
References
https://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html#wc-invocation
https://www.mkssoftware.com/docs/man1/wc.1.asp#:~:text=wc%20counts%20the%20number%20of,16%2Dbit%20wide%20Unicode%20files.
Can we detect those locale-based settings in the Python implementation of wc, without changing the core C implementation and the Python binding?
For counting characters we can locale.getencoding() in python then a naive approach would be len(bytes.decode('utf-8')) which would not be performant. Ultimately we'd want to be able to scan for unicode characters ( & 0x80 ) and consume them as the character could be 2-4 bytes. If the library does not have a way to find bytes with the first bit set (& 0x80) we'd have to add it.
For counting words I believe we want a find_charset function that we can use with the whitespace character set.
For the first part we can temporarily compensate that by performing several runs over data - one for each multi-byte rune.
UTF-8 looks like this - you can count bits for the character size once you see the left most bit set. Languages like Chinese will be all unicode characters. I speak Chinese so optimized this in mrjson. I'll setup tests next.
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx