prosegrinder / python-prosegrinder

A relatively fast, functional prose text counter with readability scoring.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for detecting English phonemes

yvlcmb opened this issue · comments

Adding a feature to prosegrinder to enable the detection of English phonemes might be really useful.

There are over 40 phonemes in English despite only 26 letters in its alphabet - here's one source: https://www.dyslexia-reading-well.com/44-phonemes-in-english.html.

I don't know how this would work, perhaps implemented as a new 'get_phoneme' method to the Dictionary class in the dictionary.py module. It would be nice if the feature could not only count the total number of phonemes but also output an iterable containing the specific phonemes that occur in a section of text/prose.

This is doable but might take me a little while to get to it.

Here are some possible existing packages for phonemes (quick search for 1.x+ versioned packages):

I'll probably wind up incorporating one of these to do the underlying work, so if you test them out or have an opinion on which one works well for your use cases, let me know here.

Thanks for the recommendations, I had never heard of either of those, they both look pretty capable, I'll try them out.

Turns out CMUdict is still probably the best source of phones.

  • phonemizer has external (i.e. non-python) dependencies, which I want to avoid at all costs.
  • gruut is self-contained, but a quick query on it's lexicon.db shows only 128870 entries. The CMUdict contains 135115 entries.

gruut does apparently have the ability to guess the pronunciation of words not in its lexicon, but I might be able to add something similar later. For now, I'm going to start with just cmudict and see how far that goes.

@slingload - I have a branch called phones that adds minimal support for this. If you get some time, would you test it out and let me know if it's close to what you need, please?

I cloned phones and tested it out, seems to work nicely! Here's what I did, let me know if I was using it incorrectly:

>>> from prosegrinder import Prose
>>> quotes = [
    ...: "All that glitters is not gold.",
    ...: "Hell is empty and all the devils are here.",
    ...: "Uneasy lies the head that wears a crown.",
    ...: ]
>>> text = ' '.join(quotes)
>>> p = Prose(text)
>>> p.phone_count
73
>>> p.phone_frequency
{'AO': 2,
 'L': 7,
 'DH': 4,
 'AE': 2,
 'T': 5,
 'G': 2,
 'IH': 3,
 'ER': 1,
 'Z': 7,
 'N': 4,
 'AA': 2,
 'OW': 1,
 'D': 4,
 'HH': 3,
 'EH': 5,
 'M': 1,
 'P': 1,
 'IY': 4,
 'AH': 6,
 'V': 1,
 'R': 4,
 'AY': 1,
 'W': 1,
 'K': 1,
 'AW': 1}

This is exactly the kind of easy interface I was hoping for, well done!

Closed by #14