Table of Contents
The purpose of this app is to receive a text file (or multiple text files) as user input and return a list of the top 100 most frequent three-word phrases, as well as their counts.
Ruby 2.7.5
- Clone the repo
git clone https://github.com/bkiggen/top-100.git
- Install Gems
bundle install
Simply run the rake task, followed by a relative filepath to a .txt file
rake run <your-txt-file-location>
For example:
rake run ./spec/test_texts/moby-dick.txt
You can also send multiple .txt files like so:
rake run ./spec/test_texts/moby-dick.txt ./spec/test_texts/brothers-karamazov.txt
Or to run the script with Standard Input, use:
cat spec/test_texts/moby-dick.txt | rake run
For the ease of experimenting, I've also added a few rake commands with .txt files included. Try:
rake moby
rake brothers
rake both
To run tests using RSPEC, enter the following command:
rake test
- The program accepts a list of one or more file paths (e.g.
ruby solution.rb texts/moby-dick.txt brothers-karamazov.txt ...).
- The program also accepts input via stdin (e.g.
cat texts/*.txt | java solution.java
). - The program outputs the first 100 most common three word sequences.
- The program ignores punctuation, line endings, and is case insensitive
-
“I love\nsandwiches.”
should be treated the same as"(I LOVE SANDWICHES!!)"
). - Contractions shouldn't be changed into two words (eg.
can't
should not becomecan t
). - Hyphens on line-endings can be treated as continuations OR punctuation.
- Unicode may also be handled in any way you like (including replacing it with a space)
- Dockerize script
- Improve handling of non-ASCII characters
- Could be cool to output to CSV
- I'd like to learn more about stdin/stdout in CLIs and improve on its implementation here
- In order to satisfy the stdin requirement and ensure that type of input was tested, I had to add some branching logic to the cli class that I wasn't crazy about. Everything works but it feels hacky.
Distributed under the MIT License.
Ben Kiggen - benkiggen@gmail.com
Project Link: https://github.com/bkiggen/top-100