Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. A homophone is a word that is pronounced the same as another word but differs in meaning and spelling. For example-carat, caret, and carrot, or to, two, and too.
The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms .
The concept can be used to provide alternative spelling for incorrect words or something like Google’s “Did you mean ?” functionality ( not as advance though )! Soundex codes can be used to list words with the same code in case the original word is not found in the known list of valid words. Something similar to spell check feature found in most word-processors. This method is very old (first one was patented in 1918) and more advanced variation are available now. However, it’s still good enough to be asked in computer practical examination!
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. A homophone is a word that is pronounced the same as another word but differs in meaning and spelling. For example-carat, caret, and carrot, or to, two, and too.
The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms .
American Soundex
The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Similar sounding consonants share the same digit so, for example, the labial consonants B, F, P, and V are each encoded as the number 1.
The correct value can be found as follows:
- Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
- Replace consonants with digits as follows (after the first letter):
- b, f, p, v => 1
- c, g, j, k, q, s, x, z => 2
- d, t => 3
- l => 4
- m, n => 5
- r => 6
- If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
- Iterate the previous step until you have one letter and three numbers. If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.
Using this algorithm, both “Robert” and “Rupert” return the same string “R163” while “Rubin” yields “R150”. “Ashcraft” and “Ashcroft” both yield “A261” and not “A226” (the chars ‘s’ and ‘c’ in the name would receive a single number of 2 and not 22 since an ‘h’ lies in between them). “Tymczak” yields “T522” not “T520” (the chars ‘z’ and ‘k’ in the name are coded as 2 twice since a vowel lies in between them). “Pfister” yields “P236” not “P123” (the first two letters have the same number and are coded once as ‘P’).
The soundex implementation in Java programming language is as follows:
class Soundex{ private static int getConsonantCode( char ch ){ String codeList[] = { "BFPV", "CGJKQSXZ","DT","L","MN","R" }; int code = 0; for( int i = 0 ; i < codeList.length ; i++ ){ if( codeList[i].indexOf(ch) >= 0 ) { code = i+1; } } return code; } private static boolean isVowel( char ch ){ return (new String("AEIOUaeiou")).indexOf(ch) >= 0 ; } public static String getSoundexCode( String str ){ str=str.toUpperCase(); String soundexCode = "" + str.charAt(0), temp=""; int length = str.length(); char curr, prev, next;{ } String dropList = "AEIOUYHW"; for( int i=1 ; i< length ; i++ ){ curr = str.charAt(i); prev = str.charAt( i-1 ); if( ( curr=='H' || curr == 'W') && i != length-1 ){ if( temp.length() >= 2) temp=temp.substring(1); next=str.charAt( i+1 ); if( isVowel(curr) && getConsonantCode( prev ) == getConsonantCode( next ) ){ temp += prev+prev; i=i+1; }else if( getConsonantCode( prev ) == getConsonantCode( next ) ){ temp += prev; i=i+1; } }else if( getConsonantCode( curr ) != getConsonantCode( prev ) ){ if( dropList.indexOf( curr ) == -1 ){ temp += curr; } } } temp = ( temp + "0000" ).substring( 0, 3 ); for( int i = 0; i<=2 ; i++ ){ soundexCode += getConsonantCode( temp.charAt( i ) ); } return soundexCode; } public static void main( String []args ){ System.out.println( getSoundexCode( "Ashcraft" )+" "+"A261" ); System.out.println( getSoundexCode( "Ashcroft" )+" "+"A261" ); System.out.println( getSoundexCode( "Gauss" )+" "+"G200" ); System.out.println( getSoundexCode( "Ghosh" )+" "+"G200" ); System.out.println( getSoundexCode( "Hilbert" )+" "+"H416" ); System.out.println( getSoundexCode( "Heilbronn" )+" "+"H416" ); System.out.println( getSoundexCode( "Lee" )+" "+"L000" ); System.out.println( getSoundexCode( "Lloyd" )+" "+"L300" ); System.out.println( getSoundexCode( "Moses" )+" "+"M220" ); System.out.println( getSoundexCode( "Pfister" )+" "+"P236" ); System.out.println( getSoundexCode( "Robert" )+" "+"R163" ); System.out.println( getSoundexCode( "Rupert" )+" "+"R163" ); System.out.println( getSoundexCode( "Rubin" )+" "+"R150" ); System.out.println( getSoundexCode( "Tymczak" )+" "+"T522" ); System.out.println( getSoundexCode( "Soundex" )+" "+"S532" ); System.out.println( getSoundexCode( "Example" )+" "+"E251" ); } }
The above code is pretty self-explanatory but the use of indexOf() function at line number deserves some explanation. The function indexOf( ) returns the position of the argument within the string against which it’s called, it returns -1 if the character passed as an argument doesn’t exist in the string. If the value returned is zero or more it indicates the presence of the character passed as argument and if the value returned is negative one it indicates the absence of the character passed as an argument. This little trick allows us to write compact code which would would otherwise have taken a lot of code. The main function contains a lot of test cases. The main() function displays the soundex code along with the correct code, so that the output can be verified visually.