bgutter / cl-phonetic

Phonetic pattern matching library for Common Lisp

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cl-phonetic

A phonetic pattern matching library for Common Lisp.

This library is intended to replace the Sylvia library for Python.

The Short Version

(let
    ((dict (from-cmudict "./cmudict")))

  (pronounce-word dict "tomato")
  ;; (#<PRONUNCIATION T AH M AA T OW> #<PRONUNCIATION T AH M EY T OW>)

  (regex-search dict "P .* EY T OW #+")
  ;; Words starting with the P sound, followed by anything, and ending with the
  ;; EY T OW sequence, plus at least one more consonant.
  ;; (("potatoes" #<PRONUNCIATION P AH T EY T OW Z>)
  ;;  ("plato's" #<PRONUNCIATION P L EY T OW Z>))

  (regex-search dict "#<v,,f> %% NG")
  ;; Two syllable words beginning with a voiced fricative and ending with NG
  ;; (("zooming" #<PRONUNCIATION Z UW M IH NG>)
  ;;  ("zapping" #<PRONUNCIATION Z AE P IH NG>)
  ;;  ("voicing" #<PRONUNCIATION V OY S IH NG>)
  ;;  ("vesting" #<PRONUNCIATION V EH S T IH NG>)
  ;;  ...

  (generate-regex 'rhyme (first (pronounce-word dict "roses")))
  ;; ".* OW Z IH Z "

  (generate-regex 'rhyme (first (pronounce-word dict "roses")) :loose t)
  ;; ".* OW #* Z #* IH #* Z #* "

  (generate-regex 'rhyme (first (pronounce-word dict "roses")) :near t)
  ;; ".* OW Z [  AE EH IH IY  ] Z "

  (the-words (find-metapattern dict 'rhyme "roses"))
  ;; Words that perfectly rhyme with "roses"
  ;; ("supposes" "roses" "rose's" "proposes" "primroses" "presupposes" "poses"
  ;;  "overexposes" "opposes" "noses" "juxtaposes" "imposes" "hoses" "forecloses"
  ;;  "exposes" "dozes" "disposes" "discloses" "decomposes" "composes" "closes"
  ;;  "bulldozes")

  (the-words (find-metapattern dict 'rhyme "roses" :loose t))
  ;; Words that loosely or perfectly rhyme with "roses"
  ;; ("supposes" "roses" "rose's" "rolls's" "proposes" "primroses" "presupposes"
  ;;  "poses" "overexposes" "opposes" "noses" "juxtaposes" "joneses" "jones's"
  ;;  "imposes" "hoses" "holmes's" "forecloses" "exposes" "dozes" "doles's"
  ;;  "disposes" "discloses" "decomposes" "composes" "closings" "closes" "bulldozes")

  (the-words (find-metapattern dict 'rhyme "roses" :loose t :near t))
  ;; ("supposes" "rosie's" "roses" "rosecrans" "roseanne's" "rose's" "rolls's"
  ;;  "proposes" "primroses" "presupposes" "poses" "overexposes" "opposes" "noses"
  ;;  "juxtaposes" "joneses" "jones's" "imposes" "hoses" "holmes's" "grozny's"
  ;;  "forecloses" "exposes" "dozes" "doles's" "disposes" "discloses" "decomposes"
  ;;  "composes" "closings" "closes" "bulldozes" "bozell's")

  (the-words (find-metapattern dict 'consonance "babble"))
  ;; Words with consonance against "babble"
  ;; ("burble" "bubel" "bubbly" "bubble" "bobble" "biebel" "bibler" "bible" "bauble"
  ;;  "babula" "babler" "babel" "babbler" "babble")

  (the-words (find-metapattern dict 'consonance "babble" :loose t))
  ;; Words with a loose consonance against "babble"
  ;; ("obtainable" "observable" "objectionable" "butterball" "burnable" "burble"
  ;;  "bumbly" "bumble" "bumbalough" "buildable" "bubel" "bubbly" "bubble"
  ;;  "brumbelow" "brightbill" "brechbill" "breakable" "bramble" "brambila"
  ;;  "brakebill" "brackbill" "botshabelo" "bookmobile" "bonnibelle" "bonnibel"
  ;;  "bobble" "bobadilla" "bluebottle" "bluebell" "blankenbeckler" "blackball"
  ;;  "biodegradable" "billable" "biebel" "biddable" "biblical" "bibler" "bible"
  ;;  "berkebile" "believable" "bearably" "bearable" "beachball" "bauble"
  ;;  "basketball" "baseball" "barboursville" "barbella" "barbell" "barbanel"
  ;;  "barbagallo" "bankable" "babula" "babler" "babel" "babbler" "babble"
  ;;  "abominable")

  (the-words (find-metapattern dict 'assonance "ivanhoe"))
  ;; Words with assonance against "ivanhoe"
  ;; ("zydeco" "xylophone" "virazole" "styrofoam" "microscopes" "microscope"
  ;;  "microphone" "kayapo" "ivanhoe" "ivaco" "isotopes" "isotope" "isentrope"
  ;;  "idaho's" "idaho" "gyroscopes" "gyroscope" "dynamo" "dialtone" "diagnosed"
  ;;  "diagnose" "cyclostomes" "cyclostome")

  (the-words (find-metapattern dict 'alliteration "phone"))
  ;; Words with alliteration agaisnt "phone"
  ;; ( ... "foam" "phoning" "fairway" "philistine" ...)

  (pronounce-utterance dict "Eat it!")
  ;; #<PRONUNCIATION IY T IH T>

  (the-words (find-metapattern dict 'rhyme (pronounce-utterance dict "Eat it!") :loose t))
  ;; ("restricts" "restrict" "elitists" "elitist" "defeatist" "betwixt")
  )

Features

This library is in-progress, and each feature is at varying degrees of readiness.

Status SymbolMeaning
💡Planning stage.
Initial groundwork started. Usually, this just means that it’s already done in Sylvia’s codebase.
🚧Currently under active implementation.
Done

✅ Phonetic Pattern Matching via Regular Expressions

cl-phonetic can search a phonetic dictionary for words whose pronunciations match a phonetic regular expression. These regex are similar to Perl in syntax, but, have nothing to do with ASCII or Unicode character sets. Instead, the alphabet of the language tested by these regex consists only of phonemes.

Phoneme Literals

In a phonetic regex, phoneme literals are defined according to the ARPABET, as it was used by cmudict. A full list of ARPABET phoneme encodings from that link is reproduced here.

PhonemeExampleTranslation
AAoddAA D
AEatAE T
AHhutHH AH T
AOoughtAO T
AWcowK AW
AYhideHH AY D
BbeB IY
CHcheeseCH IY Z
DdeeD IY
DHtheeDH IY
EHEdEH D
ERhurtHH ER T
EYateEY T
FfeeF IY
GgreenG R IY N
HHheHH IY
IHitIH T
IYeatIY T
JHgeeJH IY
KkeyK IY
LleeL IY
MmeM IY
NkneeN IY
NGpingP IH NG
OWoatOW T
OYtoyT OY
PpeeP IY
RreadR IY D
SseaS IY
SHsheSH IY
TteaT IY
THthetaTH EY T AH
UHhoodHH UH D
UWtwoT UW
VveeV IY
WweW IY
YyieldY IY L D
ZzeeZ IY
ZHseizureS IY ZH ER

When they occur in a phonetic regex, these phoneme literals should be space delimited. For example, K AE T is a phonetic regex which matches the English word “cat”.

Since these regex are Perl-like, K AE .* is also a valid phonetic regex, and matches words like “cat”, “Canberra”, “cathode”, etc.

Phoneme Class Expressions

cl-phonetic further extends Perl syntax by introducing a new facility for defining classes and sequences of phonemes. To start;

  • # matches any single consonant phoneme
  • @ matches any single vowel phoneme
  • % matches any single syllable

Both the # and @ class symbols may optionally accept arguments which further constrain matches. These arguments consist of comma delimited characters within angle brackets. For example, #<v,,f> which matches only voiced, fricative consonants.

You need only supply as many arguments as desired, and can leave fields empty as needed. For example, the following class definitions are all valid, and all compile to the same phoneme sets; @, @<>, @<,>, and @<,,>.

Consonant Class Options

For consonant classes (the #<,,> pattern), up to three arguments can be specified;

  • First, a single character which can restrict matches based on voicing.
  • Second, sequence of characters which restricts matches based on place of articulation.
  • Third, a sequence of characters which restricts matches based on method of articulation.

When multiple characters are supplied for a single parameter, the resulting matches are a union over those characters. That is, there’s an implicit OR over your arguments.

Consonant voicing arguments:

CharacterRestricts Matches To
vVoiced
uUnvoiced

Consonant place-of-articulation arguments

CharacterRestricts Matches To
aAlveolar
bBilabial
dDental
gGlottal
lLabio-dental
pPost-alveolar
tPalatal
vVelar

Consonant method-of-articulation arguments

CharacterRestricts Matches To
aAffricate
fFricative
lLateral
nNasal
pPlosive
xApproximant

Examples:

Phoneme Class DefinitionWhat It Matches
#All consonants
#<,,>All consonants
#<v>All voiced consonants
#<v,,>All voiced consonants
#<,,p>All plosive consonants
#<v,,p>All consonants which are both voiced and plosive
#<,bd,>All consonants which are either bilabial or dental
#<,,fa>All consonants which are either fricative or affricate
#<u,bd,fa>All consonants which are unvoiced, and also either bilabial or dental, and also either fricative or affricate

Vowel Class Options

For vowel classes (the @<,,> pattern), three parameters may also be specified;

  • First, height
  • Second, backness
  • Third, roundedness

The first two of these categories are fairly fluid, and so are encoded as numbers. As with consonants, when multiple characters are supplied for a single parameter, the resulting matches are a union over those characters. That is, there’s an implicit OR over your arguments.

Vowel height arguments:

CharacterRestricts Matches To
1Open
2Near Open
3Open Mid
4Mid
5Close Mid
6Near Close
7Close

Vowel backness arguments:

CharacterRestricts Matches To
1Front
2Central
3Back

Vowel roundedness arguments:

CharacterRestricts Matches To
rRounded
uUnrounded

Examples:

Phoneme Class DefinitionWhat it Matches
@All vowels
@<,,>All vowels
@<,,r>All rounded vowels
@<12,,u>All vowels which are unrounded and either open or near open height.
@<,23>All vowels with either a central or back backness

Diphthongs and the r-colored phoneme, for now, are excluded whenever any restrictions are applied. They will only match a plain @, or, their associated phoneme literals.

✅ Phonetic Metapatterns via Regular Expression Generators

cl-phonetic can function as a rhyming dictionary by way of phonetic metapatterns. Other literary devices, like assonance, consonance, and alliteration, can also be queried.

A phonetic metapattern is a function which transforms a pronunciation (the phoneme sequence associated with a word) into a regular expression. This resulting regular expression implements the given metapattern over the given word.

rhyme

The =’rhyme= metapattern applied to a word word produces a regular expression which matches words that rhyme with word. A rhyming word is defined here as any phoneme sequence whose phonemes match exactly after the first vowel phoneme. With the :loose option, additional consonant phonemes may be interspersed.

consonance

The =’consonance= metapattern produces a regular expression which matches all words containing the same sequence of consonant phonemes as the target word. Vowel phonemes are ignored. With the :loose option, additional consonants may be interspersed.

assonance

The =’assonance= metapattern produces a regular expression which matches all words containing the same sequence of vowel phonemes as the target word. Consonant phonemes are ignored. With the :loose option, additional vowels may occur before or after the matched sequence.

alliteration

The =’alliteration= metapattern produces a regular expression which matches all words which begin with the same phoneme as the target word.

⛏ Pronunciation Inferencing

Arbitrary character sequence to phoneme sequence mapping. Sylvia has a quirky ruleset for this, which works fairly well. But it might be more fun to fit a transducer instead.

⛏ Popularity Filtering & Sorting

Allow searches to be applied in order of word popularity, and limit by either popularity threshold or total match count. Helps to prevent obscure words cluttering results.

💡 Corpus Statistics

Calculating phoneme N-grams, at the bare minimum. Basically a quick-path for processing large corpus.

User Manual

Reading a Phonetic Dictionary

Currently, only cmudict-like text files are supported.

(defparameter *dict* (from-cmudict #P"cmudict"))
*DICT*

Pronounce a word.

pronounce-word produces a list of pronunciation objects.

Sometimes, there’s just one pronunciation in it:

(pronounce-word *dict* "creepy")
(#<PRONUNCIATION (K R IY P IY)>)
T

Sometimes, there’s more:

(pronounce-word *dict* "tomato")
(#<PRONUNCIATION (T AH M AA T OW)> #<PRONUNCIATION (T AH M EY T OW)>)
T

Search for words matching a phonetic regular expression.

regex-search returns an alist of words (strings) and pronunciation lists.

(regex-search *dict* "K AE T")
(("katt" #<PRONUNCIATION (K AE T)>) ("kat" #<PRONUNCIATION (K AE T)>)
 ("catt" #<PRONUNCIATION (K AE T)>) ("cat" #<PRONUNCIATION (K AE T)>))

the-words takes an alist of that form and returns list a list of words.

(the-words (regex-search *dict* "K AE T"))
("katt" "kat" "catt" "cat")

The regex are generally Perl-like. Searching is done as “matches”, meaning that the word’s pronunciation must match the entire regex. Add .* to both ends if you want a scanning behavior.

(the-words (regex-search *dict* ".* K AE T .*"))
("yekaterinburg" "wildcatting" "wildcatters" "wildcatter" "wildcats" "wildcat"
 "wicat" "tomcat" "thundercats" "thundercat" "scattershot" "scattering"
 "scattergory" "scattergories" "scattergood" "scattered" "scatter" "scatology"
 "scatological" "scat" "pussycats" "pussycat" "polecats" "polecat" "piscataway"
 "muscat" "metlakatla" "mchatton" "mcatee" "kotsonis's" "kotsonis'" "kotsonis"
 "kitcat" "kikatte" "katzman" "katzin" "katzer" "katzenstein" "katzenberger"
 "katzenberg's" "katzenberg" "katzenbach" "katzen" "katz" "kattner" "katt"
 "katsushi" "katsaros" "katsanos" "kats" "katmandu" "katashiba" "kat"
 "copycatting" "copycats" "copycat" "concatenation" "concatenating"
 "concatenates" "concatenated" "concatenate" "catwoman" "catwalk" "catty"
 "catton" "catto" "cattlemen's" "cattlemen" "cattle" "catterton" "catterson"
 "catterall" "cattanach" "catt" "catskills" "catskill" "cats" "catron"
 "catrett" "catrambone" "caton" "catoe" "catnip" "catnap" "catlin" "catlike"
 "catlett" "catledge" "catkins" "catfish" "caterwaul" "caterpiller's"
 "caterpiller" "caterpillars" "caterpillar's" "caterpillar" "category"
 "categorizing" "categorizes" "categorized" "categorize" "categorization"
 "categories" "categorically" "categorical" "catechism" "catcalls" "catcall"
 "catbird" "catatonic" "catastrophic" "cataracts" "cataract" "catapults"
 "catapulting" "catapulted" "catapult" "catamount" "catalyzed" "catalyze"
 "catalytic" "catalysts" "catalyst's" "catalyst" "catalonian" "catalonia"
 "cataloguing" "catalogues" "catalogued" "catalogue" "catalogs" "cataloging"
 "catalogers" "cataloger" "cataloged" "catalog" "catalina" "catalans" "catalan"
 "catala" "catain" "catacombs" "catacomb" "cataclysmic" "cataclysm"
 "cat-o-nine-tails" "cat-6" "cat-4" "cat-3" "cat-2" "cat-1" "cat's" "cat"
 "bobcats" "bobcat" "bacot")

Again, anything that works with Perl should work here. .? translates to “optionally, a single phoneme of any kind”.

(the-words (regex-search *dict* ".? AE T"))
("vat" "that" "tat" "shatt" "schadt" "sat" "ratte" "rat" "patt" "pat" "nat"
 "matte" "matt" "mat" "lat" "katt" "kat" "jagt" "hatt" "hat" "gnat" "gatt"
 "gat" "fat" "dat" "chat" "catt" "cat" "bhatt" "batte" "batt" "bat" "at")

And so on.

Consonants are encoded with # symbols.

(the-words (regex-search *dict* "# AE T"))
("vat" "that" "tat" "shatt" "schadt" "sat" "ratte" "rat" "patt" "pat" "nat"
 "matte" "matt" "mat" "lat" "katt" "kat" "jagt" "hatt" "hat" "gnat" "gatt"
 "gat" "fat" "dat" "chat" "catt" "cat" "bhatt" "batte" "batt" "bat")

They can be further restricted by voicing, place of articulation, and manner of articulation.

For example, here are the words ending with “AE T” that begin with a voiced, fricative consonant:

(the-words (regex-search *dict* "#<v,,f> AE T"))
("vat" "that")

And the words ending with “AE T” that begin with a bilabial, plosive consonant:

(the-words (regex-search *dict* "#<,b,p> AE T"))
("patt" "pat" "bhatt" "batte" "batt" "bat")

And the words ending with “AE T” that begin with a bilabial or labio-dental consonant:

(the-words (regex-search *dict* "#<,bl,> AE T"))
("vat" "patt" "pat" "matte" "matt" "mat" "fat" "bhatt" "batte" "batt" "bat")

All single syllable words beginning with a “B” phoneme, a single vowel, and a “D”.

(the-words (regex-search *dict* "B @ D"))
("byrd" "burd" "budde" "budd" "bud" "boyde" "boyd" "bowed" "booed" "bode"
 "bird" "bide" "bid" "beede" "bede" "bed" "bead" "bayed" "bawd" "baud" "bade"
 "bad" "baade")

The previous expression, restricted to vowels with a height between open and mid, inclusive.

(the-words (regex-search *dict* "B @<1234,,> D"))
("budde" "budd" "bud" "bed" "bawd" "baud" "bad" "baade")

Generating a phonetic regular expression

generate-regex creates a phonetic regular expression from a predefined metapattern and a word.

(generate-regex 'rhyme (first (pronounce-word *dict* "Candor")))
.* AE N D ER

Searching for this regex yields words that perfectly rhyme with “Candor”.

(the-words (regex-search *dict*
                         (generate-regex 'rhyme
                                         (first (pronounce-word *dict* "Candor")))))
("zander" "wicklander" "vandevander" "vander" "telander" "swartzlander"
 "subcommander" "standre" "stander" "stadtlander" "slander" "skenandore"
 "sjolander" "scalamandre" "santander" "sandor" "sander" "salamander"
 "rosander" "rander" "philander" "pander" "oleander" "nederlander" "meander"
 "mcalexander" "mander" "mainlander" "lysander" "leander" "landor" "lander"
 "highlander" "hander" "grander" "glander" "gerrymander" "gander" "evander"
 "coriander" "commander" "candor" "calamander" "bystander" "brander" "blander"
 "bander" "aulander" "ander" "alexander" "aleksandr" "aleksander")

But, if all you’re going to do is search for the generated regex, just use find-metapattern

Searching for rhymes, and other metapatterns

find-metapattern wraps the process of generating a regular expression & searching it:

(the-words (find-metapattern *dict* 'rhyme "Turkey" :loose t))
("yerkey" "yerkes" "yaworski" "xerxes" "workweeks" "workweek" "worksheets"
 "worksheet" "tyburski" "twersky" "turski" "turnkey" "turkeys" "turkey's"
 "turkey" "swirsky" "swiderski" "sturkie" "stachurski" "sircy" "shirkey"
 "quirky" "purkey" "podgurski" "pirkey" "persky" "perky" "perkey" "pearcy"
 "murky" "mirsky" "merkley" "merkey" "kuberski" "koperski" "kirksey" "kirkley"
 "kirkey" "kirkby" "kasperski" "jerky" "hirschfield" "gursky" "gurski" "girsky"
 "gerski" "gerke" "figurski" "dworsky" "durkee" "burkley" "burkey" "burkeen"
 "birky" "birkey" "bertke" "berkley" "berklee" "berkey" "berkeley's" "berkeley"
 "anarchy" "aldercy" "albuquerque")

test-metapattern just tests whether a metapattern holds over two words.

Here, it does;

(test-metapattern *dict* 'alliteration "Xenon" "Czar")
(("Czar" #<PRONUNCIATION (Z AA R)>))

And here, it does not;

(test-metapattern *dict* 'rhyme "Wallet" "Stanford")
NIL

About

Phonetic pattern matching library for Common Lisp

License:MIT License


Languages

Language:Common Lisp 100.0%