Member Login
Username Password (Forgot?)
You are here: Learn > The Library > Magazines > Genealogical Computing

Genealogical Computing
4/1/1998 - Archive

Spring 1998 Vol. 17, No. 4

Cracking the Soundex Code

Soundex, the phonetic coding system so common in genealogy today has been used since the early 1900s. Though applied in many fields, its chief genealogical notoriety is as the method used to group similar-sounding surnames in the government-produced indexes of the 1880, 1900, 1910, and 1920 federal censuses. Aside from these indexes, the Soundex code continues to be used in the genealogical community—often to expand search functionality in a computer database.

Much has changed, however, in the decades since Soundex’s creation. It is much less accurate than its present-day competitors and should be replaced as today’s algorithm of choice for computer-assisted phonetic searching in the genealogical community.

Where Soundex Breaks Down
Soundex instructions are sparse and vague. Although Soundex has been around for almost 80 years, little has been written to help people use it. Bob Velke pointed out in a recent GENTECH lecture that this lack of documentation has caused a number of “variants” to the Soundex code, which can prevent researchers from finding the references they are looking for. What little has been written is often incomplete and/or conflicts with other instructions.

References to the coding of surname prefixes such as Van, De, La, or Le state that historical Soundex indexing may or may not have taken such prefixes into account. Such instructions do not, however, refer to any reasoning behind the inconsistency or to a recommended course for indexing in the future. A National Archives leaflet on the subject states that “Mc” and “Mac” are not considered prefixes as described above, despite the fact that their removal from a surname between generations or during immigration was not altogether uncommon. Other surname prefixes (e.g., the Irish O’ in O’Brian) are not mentioned.

The Soundex code is self-limiting. The format of any Soundex code is a letter followed by three numbers. The first letter is derived from the beginning of the coded name, so surnames which sound alike but begin with different letters of the alphabet are not Soundex equivalents. Moreover, the procedure stops coding a name after the fourth number, so longer names are less likely to be uniquely identified than shorter ones.

Soundex’s inability to fully encode longer words is indicative of another weakness in the algorithm: the system works best with common Anglo-Saxon names. Languages with markedly different phonology tend to invalidate the rules upon which Soundex was based. This is particularly true of languages which must first be transliterated to the Roman alphabet, a process which provides only an approximation of the original.

How the Computer Can Help
Arguably, the Soundex code’s simplicity has advantages: it is easy to understand and easy to use—even with paper and pencil. In a digital medium, however, other more-complicated and more-accurate systems will improve results without confusing the user. The phonetic aspect of a given search can be completely automatic. Ideally, a researcher should not be forced to understand the detailed underpinnings of such an algorithm; he or she should only have to understand that it works.

Whatever system is used, it is important to inform end-users about the methods employed in a program or database. No code can fully compensate for all surname variations, and researchers should know when to exercise their own initiative to search “blind spots” unaccounted for by the programmer. A section in the online documentation or accompanying user’s manual should address these issues and explain potential problems.

Interestingly enough, the growth of the Internet has been partially responsible for the recent attention given to phonetic codes by programmers. Some of the mammoth indexing and retrieval programs we call search engines are adding phonetic approximation to their list of features to help compensate for misspellings from users.

Getting a Smarter Code
Anyone can see how a simple change to the Soundex code can help solve some of its problems. Surnames which are considered equivalent but differ in their initial spelling become equivalents in the Extended Soundex, where the first letter is Soundex coded like the remainder of the name. Unfortunately, this scheme can introduce additional error as well by increasing the number of like-coded names to include.

Other phonetic codes are not so crude. Guth Coding and the Daitch-Mokotoff system, for example, are more complex in their construction and were created to address the particular needs of French and Eastern European phonology, respectively. Although quite successful in their spheres (especially compared to the Soundex code), they, too, are limited by the very standards which spurred their creation; their added abilities apply only to the languages which defined them.

Other commercially-based programs are at least as universal as the Soundex Code, while achieving a significantly better accuracy percentage. PhDbase III™ used in Wholly Genes’ The Master Genealogist and Metaphone™ used by Progeny Software, are two examples. The method used by the LDS Church in such databases as the International Genealogical Index™ actually uses no conversion code at all. Surname equivalents are determined by consulting a lengthy table compiled using a complicated set of linguistic principles and over 40 years of experience in processing genealogical data.

Soundex is certainly a great help to those researching federal census records. It was perhaps the best tool available sixty years ago for the creation of the indexes we use today. It is certainly a better code than no code at all. For today’s use, however, and especially in view of the processing abilities of the personal computer, genealogical programmers should look elsewhere for a system to identify phonetic equivalents. Doing so will help genealogists find what they’re looking for more easily.

Jake Gehring is presently employed with Ancestry Incorporated in Utah. He graduated from Brigham Young University with a degree in genealogy/family history. Jake is a genealogical lecturer, editor, and author, and can be reached at RootsSeekr@aol.com.


  Printer Friendly
 
E-mail to a friend

Search The Library