bkiggen / top-100

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Logo

Ahab's Phrase Frequency Counter

A ruby script built to count the frequency of all three-word phrases in a text or group of texts.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Known Issues
  6. License
  7. Contact

About The Project

The purpose of this app is to receive a text file (or multiple text files) as user input and return a list of the top 100 most frequent three-word phrases, as well as their counts.

Built With

Ruby

(back to top)

Getting Started

Prerequisites

Ruby 2.7.5

Installation

  1. Clone the repo
    git clone https://github.com/bkiggen/top-100.git
  2. Install Gems
    bundle install

(back to top)

Usage

Running the app

Simply run the rake task, followed by a relative filepath to a .txt file

rake run <your-txt-file-location>

For example:

rake run ./spec/test_texts/moby-dick.txt

You can also send multiple .txt files like so:

  rake run ./spec/test_texts/moby-dick.txt ./spec/test_texts/brothers-karamazov.txt

Or to run the script with Standard Input, use:

  cat spec/test_texts/moby-dick.txt | rake run

For the ease of experimenting, I've also added a few rake commands with .txt files included. Try:

  rake moby
  rake brothers
  rake both

Testing

To run tests using RSPEC, enter the following command:

rake test

(back to top)

Requirements

  • The program accepts a list of one or more file paths (e.g. ruby solution.rb texts/moby-dick.txt brothers-karamazov.txt ...).
  • The program also accepts input via stdin (e.g. cat texts/*.txt | java solution.java).
  • The program outputs the first 100 most common three word sequences.
  • The program ignores punctuation, line endings, and is case insensitive
  • “I love\nsandwiches.” should be treated the same as "(I LOVE SANDWICHES!!)").
  • Contractions shouldn't be changed into two words (eg. can't should not become can t).
  • Hyphens on line-endings can be treated as continuations OR punctuation.
  • Unicode may also be handled in any way you like (including replacing it with a space)

Roadmap

  • Dockerize script
  • Improve handling of non-ASCII characters
  • Could be cool to output to CSV
  • I'd like to learn more about stdin/stdout in CLIs and improve on its implementation here

(back to top)

Known Issues

  • In order to satisfy the stdin requirement and ensure that type of input was tested, I had to add some branching logic to the cli class that I wasn't crazy about. Everything works but it feels hacky.

(back to top)

License

Distributed under the MIT License.

(back to top)

Contact

Ben Kiggen - benkiggen@gmail.com

Project Link: https://github.com/bkiggen/top-100

(back to top)

About

License:MIT License


Languages

Language:Ruby 100.0%