pizzamig / freq

[WIP] 🗼 A CLI term frequency analyzer. Counts the number of occurrences of each word in an input and creates formatted output or a histogram.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

freq

A commandline tool that counts the number of word occurences in an input.

James Munns on Twitter

This is just a placeholder repository for now. Please create issues for feature request and collaboration.

TODO

  • Ignore words (regex pattern)
  • Filter stopwords (similar to NLTK's stopwords)
  • Different output formats (plaintext, JSON)
  • Performance (SIMD support, async execution)
  • Usage as library
  • Recursion support
  • Allow skipping files
  • Allow specifying ignored words in a separate file
  • Generate "heat bars" for words like shell-hist does
  • Split report by file/folder (sort of like sloc does for code)
  • Choose language for stopwords
  • Format output (e.g. justify counts a la uniq -c)
  • Calculate TF-IDF score in a multi-file scenario

Idea contributors:

  • @jamesmunns
  • Want to see your name here? Create an issue!

Similar tools

uniq

A basic version would be

curl -L 'https://github.com/mre/freq/raw/main/README.md' | tr -cs '[:alnum:]' "\n" | grep -vEx 'and|or|for|a|of|to|an|in' | sort | uniq -c | sort

This works, but it's not very extensible by normal users. It would also lack most of the features listed above.

Lucene

Has all the bells and whistles, but there is no official CLI interface and requires a full Java installation.

wordcount

freqword <tab> freq

Nice and simple. Doesn't exclude stopwords and no regex support, though. https://github.com/juditacs/wordcount

word-frequency

Haskell-based approach: Includes features like min length for words, or min occurrences of words in a text. https://github.com/cbzehner/word-frequency

What else?

There must be more tools out there. Can you help me find them?

About

[WIP] 🗼 A CLI term frequency analyzer. Counts the number of occurrences of each word in an input and creates formatted output or a histogram.

License:Apache License 2.0


Languages

Language:Rust 100.0%