LWWordCounter is an application that displays the frequency of words from a text file. It also summarizes several statistics for that file, such as the most popular word and the total number of words. Finally, LWWordCounter serializes this summary into either UBJSON or BSON.
The Deserializer
Python class is provided to deserialize this summary data. For more information
on using the Deserializer
, refer to this page.
git clone https://github.com/leeway64/LWWordCounter.git
cd LWWordCounter
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_MAKE_PROGRAM=ninja -G Ninja -S . -B ./cmake-build-release
cmake --build ./cmake-build-release --target LWWordCounter
📝 Note |
---|
if the CMake output says that a certain package is locked by another concurrent conan process, wait... , then run conan remove -f <package_name>/<package_version> in the terminal. |
Input is entered using a JSON file. Let's assume that the JSON file looks like this:
{
"input_file_name": "../src/tests/test_files/moby_dick.txt",
"statistics":
{
"minimum_occurrences": 900,
"k_most_frequent_words": 5,
"word_length_to_find": 4
},
"serialization_format": "UBJSON"
}
This version of Moby Dick (by Herman Melville) was found on Project Gutenberg.
This JSON file is provided as an example in the bin
folder.
Next, run LWWordCounter:
cd bin
LWWordCounter
After running the program, the console output will be:
Text file selected: ../src/tests/test_files/moby_dick.txt
File summary:
Most popular word: the
Longest word: uninterpenetratingly
Shortest word: 0
Number of words with length 4: 1359
highest_frequency 14727
unique_words 17342
total_words 222673
Top 5 words:
a 4805
and 6515
of 6747
the 14727
to 4709
Minimum number of occurrences for printing: 900
Word frequencies:
14727 the
1069 him
1770 with
1644 for
1545 all
1066 so
1822 s
3100 that
2532 his
1064 be
1443 this
1333 at
969 you
6747 of
6515 and
1747 is
4245 in
1244 whale
1105 from
4805 a
4709 to
1752 as
1180 not
1822 but
1232 by
925 one
2537 it
1900 he
2127 i
1647 was
1073 on
A summary of this text file has been serialized into moby_dick_serialized_summary.ubj
File has been analyzed and a summary has been serialized successfully
A summary of the file will also be serialized; the serialized file will be called
[input text file]_serialized_summary.[serialization format]
.
To run the C++ unit tests, make sure that the BuildTests
option in CMakeLists.txt
is set to ON
.
Then, run the following commands:
cmake --build ./cmake-build-release --target WordCounter_Tests
cd bin
WordCounter_Tests
Run the following commands in the root directory of this project.
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip wheel setuptools
pip install -r requirements.txt
pytest
py -3 -m venv .venv
.venv/Scripts/activate
pip install -U pip wheel setuptools
pip install -r requirements.txt
pytest