Counting words in different programming languages.
See the article on this project: http://juditacs.github.io/2015/11/26/wordcount.html
or the follow-up article: http://juditacs.github.io/2016/03/19/wordcount2.html
Updated: April 2, 2016
Only the ones that finish are listed. The rest run out of memory.
Rank | Experiment | CPU seconds | User time | Maximum memory | Contributor |
---|---|---|---|---|---|
1 | rust/wordcount/wordcount | 154.86 | 148.16 | 6867252 | Joshua Holmer |
2 | java -Xmx6G -classpath java:java/trove-3.0.3.jar WordCountOptimized | 217.75 | 262.39 | 3929384 | Sam Van Oort |
3 | cpp/wordcount_clang | 225.96 | 214.96 | 4373408 | Dmitry Andreev, Matias Fontanini, Judit Acs |
4 | cpp/wordcount | 230.23 | 217.0 | 4373432 | Dmitry Andreev, Matias Fontanini, Judit Acs |
5 | d/wordcount | 231.02 | 219.36 | 6294800 | Pavel Chebotarev |
6 | c/wordcount | 296.55 | 284.99 | 2424084 | gaebor |
7 | python/wordcount_py2.py | 302.02 | 289.65 | 3893812 | Gabor Szabo |
8 | go/bin/wordcount | 315.19 | 305.57 | 6084804 | David Siklosi |
9 | php7.0 php/wordcount.php | 429.57 | 345.3 | 4111168 | Braun Patrik |
10 | python/wordcount_py2_baseline.py | 511.12 | 490.27 | 8802472 | Judit Acs |
11 | mono csharp/WordCount.exe | 733.7 | 560.38 | 4783840 | Tim Posey, Peter Szel |
12 | perl/wordcount.pl | 778.48 | 758.74 | 7100124 | Larion Garaczi, Judit Acs |
13 | mono csharp/WordCountList.exe | 787.18 | 618.81 | 4787352 | |
14 | java -Xmx6G -classpath java WordCountBaseline | 791.52 | 1077.33 | 6150272 | Sam Van Oort, Rick Hendricksen, Dávid Márk Nemeskey |
15 | python/wordcount_py3.py | 877.93 | 854.83 | 7672096 | Judit Acs |
16 | php5.6 php/wordcount.php | 1051.85 | 940.92 | 12682668 | Braun Patrik |
17 | lua lua/wordcount.lua | 1134.88 | 1029.69 | 7023604 | daurnimator |
18 | julia julia/wordcount.jl | 1554.77 | 1519.4 | 7393432 | Attila Zseder, getzdan |
19 | bash/wordcount.sh | 2432.64 | 2538.23 | 13772 | Judit Acs |
20 | elixir/wordcount | 2596.05 | 2552.12 | 13938560 | Norbert Melzer |
21 | cpp/wordcount_baseline | 3005.61 | 2909.42 | 5965548 | Judit Acs |
Updated: March 28, 2016, 00:53 CET
Rank | Experiment | CPU seconds | User time | Maximum memory | Contributor |
---|---|---|---|---|---|
1 | rust/wordcount/wordcount | 21.23 | 20.28 | 990008 | Joshua Holmer |
2 | d/wordcount | 31.09 | 29.72 | 752776 | Pavel Chebotarev |
3 | cpp/wordcount | 41.98 | 39.92 | 775132 | Matias Fontanini, Judit Acs |
4 | go/bin/wordcount | 42.15 | 40.28 | 859260 | David Siklosi |
5 | python/wordcount_py2.py | 43.13 | 41.55 | 596396 | Gabor Szabo |
6 | php7.0 php/wordcount.php | 53.85 | 39.34 | 709908 | Braun Patrik |
7 | python/wordcount_py2_baseline.py | 71.46 | 69.34 | 1437328 | Judit Acs |
8 | mono csharp/WordCountList.exe | 104.0 | 72.42 | 899168 | Peter Szel |
9 | python/wordcount_py3.py | 110.37 | 107.15 | 1245036 | Judit Acs |
10 | perl/wordcount.pl | 120.85 | 117.86 | 1242060 | Larion Garaczi, Judit Acs |
11 | lua lua/wordcount.lua | 135.51 | 116.5 | 1210312 | daurnimator |
12 | php5.6 php/wordcount.php | 136.31 | 118.12 | 2126484 | Braun Patrik |
13 | java -classpath java WordCount | 143.01 | 128.67 | 1828536 | Dávid Márk Nemeskey |
14 | julia julia/wordcount.jl | 149.2 | 147.09 | 2541028 | Attila Zseder, getzdan |
15 | scala -J-Xmx2g -classpath scala Wordcount | 179.96 | 234.61 | 1423256 | Hans van den Bogert |
16 | bash/wordcount.sh | 285.97 | 301.47 | 13612 | Judit Acs |
17 | haskell/WordCount | 332.53 | 320.72 | 4217432 | Larion Garaczi |
18 | cpp/wordcount_baseline | 362.35 | 344.3 | 983244 | Judit Acs |
19 | elixir/wordcount | 397.52 | 387.1 | 2862204 | Norbert Melzer |
20 | nodejs javascript/wordcount.js | 582.64 | 580.87 | 974596 | Laci Kundra |
21 | nodejs typescript/wordcount.js | 636.28 | 609.03 | 921444 | Braun Patrik |
The task is to split a text and count each word's frequency, then print the list sorted by frequency in decreasing order. Ties are printed in alphabetical order.
- the input is read from STDIN
- the input is always encoded in UTF-8
- output is printed to STDOUT
- break only on space, tab and newline (do not break on non-breaking space)
- do not write anything to STDERR
- the output is tab-separated
- sort by frequency AND secondary sort in alphabetical order
- try to write simple code with few dependencies
- standard library
- single-thread is preferred but you can add multi-threaded or multicore versions too
The output should contain lines like this:
freqword <tab> freq
$ echo "apple pear apple art" | python2 python/wordcount.py
apple 2
art 1
pear 1
scripts/create_input.sh
downloads and unpacks the latest Hungarian Wikisource XML dump.
Why Wikisource? It's not too small not too large and more importantly, it's valid utf8.
Why Hungarian? There are many non-ascii characters and the number of different word types is high.
script/create_large_input.sh
downloads the unpacks the latest Hungarian Wikipedia.
This is the largest input used for comparison, see the first leaderboard.
To test on a small sample:
time cat data/huwikisource-latest-pages-meta-current.xml | head -10000 | python3 python/wordcount_py3.py > python_out
I strongly recommend building the Docker image instead of installing every package manually, but it's possible to install the dependencies manually.
See the installation commands in Dockerfile
.
You can run the experiment in a Docker container. The Dockerfile is provided, run:
docker build -t wordcount --rm .
This might take a while.
Load the image into a container:
docker run -it wordcount bash
You should see the cloned directory in /root
cd wordcount
bash scripts/create_input.sh
or the full dataset:
bash scripts/create_large_input.sh
bash scripts/build.sh
scripts/test.sh
runs all tests for one language, well actually for a single command.
bash scripts/test.sh "python2 python/wordcount_py2.py"
Or
bash scripts/test.sh python/wordcount_py2.py
if the file is executable and has a valid shebang line.
The script either prints OK or the list of failed tests and a final FAIL.
All commands are listed in the file run_commands.txt
and the script scripts/test_all.sh
runs test.sh with each command:
bash scripts/test_all.sh
If all tests are passed, the scripts work reasonably well. This does not mean that all output will be the same, see the full test later. For now, we consider them good enough for testing.
This command will run each test twice and append the results to results.txt. It's possible to add a comment at the end of each line.
bash scripts/compare.sh data/huwikisource-latest-pages-meta-current.xml 2 "full huwikisource"
Or test it on a part of huwikisource:
bash scripts/compare.sh <( head -10000 data/huwikisource-latest-pages-meta-current.xml) 1
Results.txt in a tab separated file that can be formatted to a Markdown table with this command:
cat results.txt | python2 scripts/evaluate_results.py
This scripts prints the fastest run for each command in a markup table like this:
Rank | Experiment | CPU seconds | User time | Maximum memory | Contributor |
---|---|---|---|---|---|
1 | rust/wordcount/wordcount | 20.57 | 19.79 | 990008 | Joshua Holmer |
2 | cpp/wc_vector | 33.3 | 31.93 | 775952 | Matias Fontanini, Judit Acs |
3 | python/wordcount_py2gabor.py | 40.13 | 38.71 | 594800 | Gabor Szabo |
Adding a new programming language or a new version for an existing programming language consists of the following steps:
- Add dependencies to the Dockerfile. Basically add the package to the existing apt-get package list.
- If it needs compiling or any other setup method, add it to
scripts/build.sh
- Add the actual invoke command to
run_commands.txt
- If your executable differs from the source file, add the executable - source code mapping to
binary_mapping.txt
. This is used byscripts/evaluate_results.py
for finding out the contributors of each program. The file is tab-separated.
- Make sure all dependencies are installed via standard packages and your code compiles.
- Your code passes all the tests.
- Make sure it runs for less than two minutes for 100,000 lines of text. If it is slower, it doesn't make much sense to add it.