reborg / pwc

Like the venerable wc unix command, but in parallel on multi-core CPUs.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pwc

pwc is a parallel vesion of the venerable Unix command line utility wc modified to run on multiple cores using Clojure reducers.

Is pwc really that faster than wc?

PWC is not yet faster than wc even removing the JVM startup time. But I'm still Working on it and there are more optimisations to apply. I'm confident it will be faster at some point. At the moment:

time pwc ~/tmp/3.1G-huge-file.txt (41s)
time wc  ~/tmp/3.1G-huge-file.txt (19s)

Additional features

pwc also calulcates word frequencies ordered by most frequent words:

echo "Very nice counting. Really very nice" > temp.txt
pwc -f temp.txt
1 6 36 temp.txt
(["nice" 2] ["very" 1] ["Really" 1] ["counting" 1] ["Very" 1])

What's missing

  • pwc doesn't support being part of a pipe, so a filename is required (instead of optional like wc).
  • pwc does not accept multiple file inputs at once
  • pwc only implemented flags are "f" and "t" and "h" for help
  • sligtly different way to handle word tokenization resulting in slightly different results

How to install

You need Java installed. Please try:

java -version

from the command line to see on what Java version you're running. Install Java following the instructions from the Oracle web site if you don't have it. Then follow the instructions below based on your version/OS.

jdk 1.7 on a Mac

brew tap reborg/taps
brew install pwc

jdk 1.6 on a Mac

Download the correct jsr166.jar to your endorsed Java Home folder (on Mac Mountain Lion, /System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home/lib/endorsed). A jsr166.jar is bundled with the github project if you happen to have jsk 1.6.0_51.

Not on a Mac

Other platforms are supported executing pwc as a plain Java executable:

curl -O https://github.com/reborg/pwc/archive/0.2.2.tar.gz
tar xvfz 0.2.2.tar.gz
cd 0.2.2
java -jar target/pwc.jar [-ft] file

I suggest to save pwc.jar in a known location and create an alias in your .bash_profile as:

alias pwc='java -jar /your/absolute/path/pwc.jar'
pwc(){ java -jar /your/absolute/path/pwc.jar $1; }

for everyday execution in your terminal.

Thanks

TODO:

  • more compatibility with wc flags
  • Add ordering to the partial reduce-f split if that improves overall perfs
  • Calculate split size based on size of the input
  • dependency on the jsr166.jar could be added as part of homebrew installation

About

Like the venerable wc unix command, but in parallel on multi-core CPUs.


Languages

Language:Clojure 93.5%Language:Perl 6.0%Language:Shell 0.6%