QueraltSM / Soundex

TDD Kata for encoding last names into a 4-character string

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Soundex

TDD Kata for encoding last names into a 4-character string

Soundex is a known algorithm for encoding last names into a 4-character string.
The goal is to encode similar-sounding names to the same representation,
so that searches with slightly misspelled names will still find appropriate matches.

The rules for Soundex encoding:

1. Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.

2. Replace consonants with digits as follows (after the first letter):

b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 6

3. If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.

4. If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.


Using this algorithm:

  • both "Robert" and "Rupert" return the same string "R163"
  • "Rubin" yields "R150"
  • "Ashcraft" and "Ashcroft" both yield "A261"
  • "Tymczak" yields "T522" not "T520", because the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them
  • "Pfister" yields "P236" not "P123", because the first two letters have the same number and are coded once as 'P'
  • "Honeyman" yields "H555"

About

TDD Kata for encoding last names into a 4-character string


Languages

Language:Swift 100.0%