This word analyser will takes a file as input and will analyse the words it contains and produce some metrics as output.
For example, given the following text:
Hello world & good morning. The date is 18/05/2016
the following output will be produced:
Word count = 9
Average word length = 4.556
Number of words of length 1 is 1
Number of words of length 2 is 1
Number of words of length 3 is 1
Number of words of length 4 is 2
Number of words of length 5 is 2
Number of words of length 7 is 1
Number of words of length 10 is 1
The most frequently occurring word length is 2, for word lengths of 4 & 5
The client provided no guidance on what constitutes a word so in current implementation, StandardWordAnalyser, a word is one or more characters seperated by one or more whitespace characters. These include spaces, tabs, new lines etc and can be denoted by thje following regular expression:
[\t\n\x0B\f\r]
Note that formatted numbers are not counted as words.
The client provided no indication of any punctuation that should be removed. There is one clue in the example though. The word morning. (including the period) is 8 characters. In the output there is no 8 character word analysed. We can infur that a period is stripped prior to analysis.
Given this, the following other punctuation characters (via a regex) are removed prior to analysis:
[-+.^:?,=()]
The maven build uses the Maven Wrapper so that the build will always be executed with the correct version and installed if necessary. This simplfies and makes builds more consistent across machines.
First clone this repository and cd into the project directory, then build using (Linux/Mac):
./mvnw clean install
or (Windows)
mvnw.cmd clean install
A normal Maven build will be executed with the one important change that if the user doesn't have the necessary version of Maven specified in .mvn/wrapper/maven-wrapper.properties
it will be downloaded for the user first, installed and then used.
The client did not specify guidance on how the program should be run.
The Analyser takes only a file as input. Below is the usage (by passing --help
:
$ java -jar target/words-1.0.0-SNAPSHOT.jar --help
Usage: words [-hV] <path>
read the contents of a plain text file and enable the display of the total
number of words, the average word length, the most frequently occurring word
length, and a list of the number of words of each length.
<path> The file containing the words to count.
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
The build produces an "uber" jar using the Shade plugin and can be run with the following command (Linux/Mac):
java -jar target/words-1.0.0-SNAPSHOT.jar src/test/resources/short_1.txt
or (Windows):
java -jar target\words-1.0.0-SNAPSHOT.jar src\test\resources\short_1.txt
This should give the exact same output as the example above
There are a couple of extra test files located in the src/test/resources
directory.
Provided in the root directory is a binary (words
) that will run the analysis. This binary is a natively compiled Linux binary that does not require a JVM. This provides much faster startup, less memory and in some cases faster performance.
This can be run via (Linux only):
./words src/test/resources/short_1.txt
to get help on usage, run:
./words --help
You will note that this will run faster as no JVM needs rto be started.
Currently native image compilation is disabled in the POM as the GraalVM JDK may not be installed. If you want to enable native image compilation, install GraalVM (version should match that in the POM) including the native image extension, uncomment the plugin in the POM and then compile as normal (clean install
).
Tip: Use SDKMAN! to manage multiple SDKs and various other libraries.
Due to XML comments not working properly, when uncommenting the graalvm plugin, you need to add the extra dashes to the entries in the buildArgs
section. It should look like:
<buildArgs>
--no-fallback
--allow-incomplete-classpath
</buildArgs>
Note the double dash prefixes.
At the moment, the file is read into memory and then processed. One imoprovement would be to stream the file from disk/url as it is processed. This would enable the analysis on memory constraint machines and enable the analysis of huge text files.
Gather more information from the client around what constitutes a word and what punctuation should be stripped and at what point. Also, should dates be included as words (they currently are).