wordcount

Counting words in different programming languages.

See the article on this project: http://juditacs.github.io/2015/11/26/wordcount.html

or the follow-up article: http://juditacs.github.io/2016/03/19/wordcount2.html

Leaderboard

Full Hungarian Wikipedia

Updated: April 2, 2016

Only the ones that finish are listed. The rest run out of memory.

Rank	Experiment	CPU seconds	User time	Maximum memory	Contributor
1	rust/wordcount/wordcount	154.86	148.16	6867252	Joshua Holmer
2	java -Xmx6G -classpath java:java/trove-3.0.3.jar WordCountOptimized	217.75	262.39	3929384	Sam Van Oort
3	cpp/wordcount_clang	225.96	214.96	4373408	Dmitry Andreev, Matias Fontanini, Judit Acs
4	cpp/wordcount	230.23	217.0	4373432	Dmitry Andreev, Matias Fontanini, Judit Acs
5	d/wordcount	231.02	219.36	6294800	Pavel Chebotarev
6	c/wordcount	296.55	284.99	2424084	gaebor
7	python/wordcount_py2.py	302.02	289.65	3893812	Gabor Szabo
8	go/bin/wordcount	315.19	305.57	6084804	David Siklosi
9	php7.0 php/wordcount.php	429.57	345.3	4111168	Braun Patrik
10	python/wordcount_py2_baseline.py	511.12	490.27	8802472	Judit Acs
11	mono csharp/WordCount.exe	733.7	560.38	4783840	Tim Posey, Peter Szel
12	perl/wordcount.pl	778.48	758.74	7100124	Larion Garaczi, Judit Acs
13	mono csharp/WordCountList.exe	787.18	618.81	4787352
14	java -Xmx6G -classpath java WordCountBaseline	791.52	1077.33	6150272	Sam Van Oort, Rick Hendricksen, Dávid Márk Nemeskey
15	python/wordcount_py3.py	877.93	854.83	7672096	Judit Acs
16	php5.6 php/wordcount.php	1051.85	940.92	12682668	Braun Patrik
17	lua lua/wordcount.lua	1134.88	1029.69	7023604	daurnimator
18	julia julia/wordcount.jl	1554.77	1519.4	7393432	Attila Zseder, getzdan
19	bash/wordcount.sh	2432.64	2538.23	13772	Judit Acs
20	elixir/wordcount	2596.05	2552.12	13938560	Norbert Melzer
21	cpp/wordcount_baseline	3005.61	2909.42	5965548	Judit Acs

5 million lines from the Hungarian Wikipedia

Updated: March 28, 2016, 00:53 CET

Rank	Experiment	CPU seconds	User time	Maximum memory	Contributor
1	rust/wordcount/wordcount	21.23	20.28	990008	Joshua Holmer
2	d/wordcount	31.09	29.72	752776	Pavel Chebotarev
3	cpp/wordcount	41.98	39.92	775132	Matias Fontanini, Judit Acs
4	go/bin/wordcount	42.15	40.28	859260	David Siklosi
5	python/wordcount_py2.py	43.13	41.55	596396	Gabor Szabo
6	php7.0 php/wordcount.php	53.85	39.34	709908	Braun Patrik
7	python/wordcount_py2_baseline.py	71.46	69.34	1437328	Judit Acs
8	mono csharp/WordCountList.exe	104.0	72.42	899168	Peter Szel
9	python/wordcount_py3.py	110.37	107.15	1245036	Judit Acs
10	perl/wordcount.pl	120.85	117.86	1242060	Larion Garaczi, Judit Acs
11	lua lua/wordcount.lua	135.51	116.5	1210312	daurnimator
12	php5.6 php/wordcount.php	136.31	118.12	2126484	Braun Patrik
13	java -classpath java WordCount	143.01	128.67	1828536	Dávid Márk Nemeskey
14	julia julia/wordcount.jl	149.2	147.09	2541028	Attila Zseder, getzdan
15	scala -J-Xmx2g -classpath scala Wordcount	179.96	234.61	1423256	Hans van den Bogert
16	bash/wordcount.sh	285.97	301.47	13612	Judit Acs
17	haskell/WordCount	332.53	320.72	4217432	Larion Garaczi
18	cpp/wordcount_baseline	362.35	344.3	983244	Judit Acs
19	elixir/wordcount	397.52	387.1	2862204	Norbert Melzer
20	nodejs javascript/wordcount.js	582.64	580.87	974596	Laci Kundra
21	nodejs typescript/wordcount.js	636.28	609.03	921444	Braun Patrik

The task

The task is to split a text and count each word's frequency, then print the list sorted by frequency in decreasing order. Ties are printed in alphabetical order.

Rules

the input is read from STDIN
the input is always encoded in UTF-8
output is printed to STDOUT
break only on space, tab and newline (do not break on non-breaking space)
do not write anything to STDERR
the output is tab-separated
sort by frequency AND secondary sort in alphabetical order
try to write simple code with few dependencies
- standard library
single-thread is preferred but you can add multi-threaded or multicore versions too

The output should contain lines like this:

freqword <tab> freq

Example

$ echo "apple pear apple art" | python2 python/wordcount.py
apple   2
art     1
pear    1

Test corpus: Hungarian Wikisource and Wikipedia

scripts/create_input.sh downloads and unpacks the latest Hungarian Wikisource XML dump. Why Wikisource? It's not too small not too large and more importantly, it's valid utf8. Why Hungarian? There are many non-ascii characters and the number of different word types is high.

script/create_large_input.sh downloads the unpacks the latest Hungarian Wikipedia. This is the largest input used for comparison, see the first leaderboard.

Usage

To test on a small sample:

time cat data/huwikisource-latest-pages-meta-current.xml | head -10000 | python3 python/wordcount_py3.py > python_out

Installation with Docker

I strongly recommend building the Docker image instead of installing every package manually, but it's possible to install the dependencies manually. See the installation commands in Dockerfile.

Docker image

You can run the experiment in a Docker container. The Dockerfile is provided, run:

docker build -t wordcount --rm .

This might take a while.

Load the image into a container:

docker run -it wordcount bash

You should see the cloned directory in /root

cd wordcount

Downloading the dataset

bash scripts/create_input.sh

or the full dataset:

bash scripts/create_large_input.sh

Compile/build/whatever the wordcount scripts

bash scripts/build.sh

Run tests on one language

scripts/test.sh runs all tests for one language, well actually for a single command.

bash scripts/test.sh "python2 python/wordcount_py2.py"

Or

bash scripts/test.sh python/wordcount_py2.py

if the file is executable and has a valid shebang line.

The script either prints OK or the list of failed tests and a final FAIL.

Run tests on all languages

All commands are listed in the file run_commands.txt and the script scripts/test_all.sh runs test.sh with each command:

bash scripts/test_all.sh

Run the actual experiment on a larger dataset

If all tests are passed, the scripts work reasonably well. This does not mean that all output will be the same, see the full test later. For now, we consider them good enough for testing.

This command will run each test twice and append the results to results.txt. It's possible to add a comment at the end of each line.

bash scripts/compare.sh data/huwikisource-latest-pages-meta-current.xml 2 "full huwikisource"

Or test it on a part of huwikisource:

bash scripts/compare.sh <( head -10000 data/huwikisource-latest-pages-meta-current.xml) 1

Results.txt in a tab separated file that can be formatted to a Markdown table with this command:

cat results.txt | python2 scripts/evaluate_results.py

This scripts prints the fastest run for each command in a markup table like this:

Rank	Experiment	CPU seconds	User time	Maximum memory	Contributor
1	rust/wordcount/wordcount	20.57	19.79	990008	Joshua Holmer
2	cpp/wc_vector	33.3	31.93	775952	Matias Fontanini, Judit Acs
3	python/wordcount_py2gabor.py	40.13	38.71	594800	Gabor Szabo

Adding a new program

Adding a new programming language or a new version for an existing programming language consists of the following steps:

Add dependencies to the Dockerfile. Basically add the package to the existing apt-get package list.
If it needs compiling or any other setup method, add it to scripts/build.sh
Add the actual invoke command to run_commands.txt
If your executable differs from the source file, add the executable - source code mapping to binary_mapping.txt. This is used by scripts/evaluate_results.py for finding out the contributors of each program. The file is tab-separated.

Adding your program to this experiment

Make sure all dependencies are installed via standard packages and your code compiles.
Your code passes all the tests.
Make sure it runs for less than two minutes for 100,000 lines of text. If it is slower, it doesn't make much sense to add it.

About

Counting words in different programming languages.

Languages

Language:Java 31.8%Language:C 11.8%Language:Python 9.1%Language:Shell 7.8%Language:Elixir 5.9%Language:PHP 5.5%Language:Go 4.3%Language:TypeScript 3.9%Language:C# 3.8%Language:C++ 2.7%Language:JavaScript 2.5%Language:Rust 2.2%Language:D 1.8%Language:Haskell 1.4%Language:Scala 1.3%Language:Lua 1.1%Language:Julia 0.9%Language:Clojure 0.9%Language:Ruby 0.7%Language:Batchfile 0.4%Language:Perl 0.3%