Add support for detecting English phonemes

Question

Add support for detecting English phonemes

yvlcmb opened this issue 3 years ago · comments

Adding a feature to prosegrinder to enable the detection of English phonemes might be really useful.

There are over 40 phonemes in English despite only 26 letters in its alphabet - here's one source: https://www.dyslexia-reading-well.com/44-phonemes-in-english.html.

I don't know how this would work, perhaps implemented as a new 'get_phoneme' method to the Dictionary class in the dictionary.py module. It would be nice if the feature could not only count the total number of phonemes but also output an iterable containing the specific phonemes that occur in a section of text/prose.

David L. Day · Answer 1 · Fri Aug 13 2021 18:09:27 GMT+0800 (China Standard Time)

This is doable but might take me a little while to get to it.

Here are some possible existing packages for phonemes (quick search for 1.x+ versioned packages):

I'll probably wind up incorporating one of these to do the underlying work, so if you test them out or have an opinion on which one works well for your use cases, let me know here.

yvlcmb · Answer 2 · Sat Aug 14 2021 18:30:47 GMT+0800 (China Standard Time)

Thanks for the recommendations, I had never heard of either of those, they both look pretty capable, I'll try them out.

David L. Day · Answer 3 · Sun Aug 15 2021 19:53:19 GMT+0800 (China Standard Time)

Turns out CMUdict is still probably the best source of phones.

phonemizer has external (i.e. non-python) dependencies, which I want to avoid at all costs.
gruut is self-contained, but a quick query on it's lexicon.db shows only 128870 entries. The CMUdict contains 135115 entries.

gruut does apparently have the ability to guess the pronunciation of words not in its lexicon, but I might be able to add something similar later. For now, I'm going to start with just cmudict and see how far that goes.

David L. Day · Answer 4 · Sun Aug 15 2021 23:24:01 GMT+0800 (China Standard Time)

@slingload - I have a branch called phones that adds minimal support for this. If you get some time, would you test it out and let me know if it's close to what you need, please?

yvlcmb · Answer 5 · Mon Aug 16 2021 13:07:49 GMT+0800 (China Standard Time)

I cloned phones and tested it out, seems to work nicely! Here's what I did, let me know if I was using it incorrectly:

>>> from prosegrinder import Prose
>>> quotes = [
    ...: "All that glitters is not gold.",
    ...: "Hell is empty and all the devils are here.",
    ...: "Uneasy lies the head that wears a crown.",
    ...: ]
>>> text = ' '.join(quotes)
>>> p = Prose(text)
>>> p.phone_count
73
>>> p.phone_frequency
{'AO': 2,
 'L': 7,
 'DH': 4,
 'AE': 2,
 'T': 5,
 'G': 2,
 'IH': 3,
 'ER': 1,
 'Z': 7,
 'N': 4,
 'AA': 2,
 'OW': 1,
 'D': 4,
 'HH': 3,
 'EH': 5,
 'M': 1,
 'P': 1,
 'IY': 4,
 'AH': 6,
 'V': 1,
 'R': 4,
 'AY': 1,
 'W': 1,
 'K': 1,
 'AW': 1}

This is exactly the kind of easy interface I was hoping for, well done!

David L. Day · Answer 6 · Mon Aug 16 2021 19:35:40 GMT+0800 (China Standard Time)

Closed by #14