Member Login
Username Password (Forgot?)
You are here: Learn > The Library > Magazines > Genealogical Computing

Genealogical Computing
7/1/2001 - Archive

Summer 2001 Vol. 21.1

GEDCOM and the GenTech Testbook Project

Congratulations! The big promotion is yours. The company will arrange for a moving firm to transfer all your possessions from your home in Suburbia to the company-owned condo in Metropolis. This is a dream come true, and you need not lift a finger.

On the appointed day, a small army of men in white coveralls appears at your home with a large van. As you watch, they carefully catalogue every item with its exact location before packing it and carrying it to the van. Assured of their competence, you know this will go well.

A week later you are met by the concierge at your new apartment building and escorted to the penthouse suite that is your new home. At first glance, everything appears exactly as it was in the old house. Even the pictures are in the right place. Then you see a note from the movers:

Dear Sir or Madam,
Due to our failure to understand some of the instructions for unpacking your goods, we were unable to unpack some items. There were also a few items for which an equivalent storage area could not be found. These items have been placed in the storage room for your personal attention.

You quickly locate the storage room. Throwing open the door, you find it jammed full of packing boxes. Now fully aware of the extent of the problem, you start checking for other problems. Where are the linens? In the old house they were in the hall linen closet. The condo has no hall linen closet. You find them in the master bedroom closet. Then you notice that your son’s bedroom furniture is in your daughter’s bedroom and her furniture is in his room. Suddenly, you realize that the garden furniture and shop tools are missing. A call to the movers does nothing to alleviate your concern. They have no idea where the missing items could be, but they assure you that they will do their best to find them.

The Problem
The above scenario is analogous to a GEDCOM transfer. When a GEDCOM file is created, all the data, or hopefully most of it, contained in the program’s database is itemized, tagged with an identifying marker, and placed in a structure as defined by the GEDCOM standard. This information is then recorded in a text file that can be interpreted by any genealogical program supporting the GEDCOM standard. At least, that is the theory.

In practice, we often find that we are trying to move data from one program to a very dissimilar one. The receiving program might not have a designated place for the data. It may also misinterpret or fail to understand the identifying tags. In these situations, each program might take a different approach to solve the problem. One program might inform you that it can’t read the data. Another program may place all the unidentified information into a note field, and yet another might create a storage location but leave it up to you to properly identify it. For the user expecting a complete and accurate transfer of data, the exercise will likely be a disaster.

The GENTECH organization, recognizing the problem, initiated the GEDCOM Testbook Project a number of years ago. A story was developed that encompassed many of the situations the average genealogist would encounter in the course of a search. Volunteers then extracted the basic information from the story and recorded it in a number of popular genealogical programs. Once entered, a series of reports were generated. A GEDCOM file was then generated and imported into the other programs participating in the test. The same reports were generated and compared to the original. The differences were noted. These tests confirmed that problems existed and indicated the extent of them.

In the summer of 2000, a new approach was initiated with the intention of defining as precisely as possible the information being lost and the reasons for this loss. To accomplish this, it was first necessary to create a standard GEDCOM file that would serve as the basis for the necessary comparisons. Realizing that most of the GEDCOM difficulties lie with the interpretation of the standard by the developers, it was decided to use a grammar file prepared by Jed Allen of the GEDCOM Coordinator’s Office for use with his GEDCHK program. A new story line was adapted to this grammar. The story was based on existing research, but events were added to make use of all but four of the legitimate GEDCOM tags. Name changes, multiple marriages, and an adoption were included. After some tweaking, the control file passed the GEDCHK test.

The Test
Four programs were selected for the initial test. Family Tree Maker 7.5 and Generations Grande Suite 8 represented the best-selling programs. Ultimate Family Tree 3 and The Master Genealogist 4 were selected to represent the more advanced programs. This turned out to be a serendipitous choice when Genealogy.com announced UFT was to be discontinued. Their many users would be faced with a monumental task should they decide to transfer their data to another program.

Volunteers were recruited to enter the story line information into the four programs as well as into a number of others. To ensure that data entry was consistent, the volunteers were given a list of items to be entered. Any data they were unable to record, they were instructed to place in notes with an appropriate comment. Source information was kept to a basic level that was well within the transfer capabilities of GEDCOM.

Once a volunteer completed data entry in their assigned program, a GEDCOM file was created and forwarded to the project leader for evaluation. Evaluation was straightforward but time consuming. In order to obtain consistency in the evaluation process, the project leader, who maintained copies of each program used in the test, prepared all evaluations. If necessary, the volunteer’s original database could be loaded to answer any questions regarding data entry.

The first step in the process was to check each of the test GEDCOMs using the GEDCHK program. This identified all tag exceptions, syntax errors, cross reference errors, and extensions to the grammar. The control file was then imported to the program to be tested. In most programs, an exception list was created listing those items the program failed to recognize. These lists were often incorrect, citing faults that were self-generated as errors, and they failed to identify those tags the program does not support. The full extent of the data transfer was then determined by comparing the data actually transferred with the original data contained in the control file.

The GEDCOM files from each of the other programs in the test were then imported to the program being evaluated. Again the exception lists, if available, were checked for veracity. As with the control file, the imported data was checked item by item for both accuracy and location. When discrepancies were found, the importing GEDCOM file was checked to verify that the data was included and to identify any possible problems due to the grammatical structure.

The Results
Initial test results indicate that the principal causes of poor GEDCOM transfers are tags out of context and developer-supplied tag extensions. Developers not using the most current version of the GEDCOM standard might explain some of these problems. Other errors are caused by the failure of program developers to interpret the standard as intended by its authors. Some developers, following a draft version of an earlier GEDCOM specification, have created a series of non-standard tags. Some of these tags are used in situations beyond the user’s control, but many others are supplied for use with various events. Unaware that these tags are not supported, the researcher will use them only to discover later that the information is invariably lost in a GEDCOM transfer.

In evaluating the initial four programs, the most commonly found fault leading to data loss was the failure to read the NOTE tag at all the possible levels at which it may appear. The ADDRess substructure as defined in the specifications was rarely properly interpreted. A source substructure intended for those situations not using the author/publisher tags fell into the same category. The adoption information transferred in all cases, but the correct linkages to the adoptive and natural parents were not made. Other problems occurring sporadically were instances where the DATE and PLACe tags were out of context and the QUAlitY tag was found at the wrong levels. Preliminary examination of the six other programs currently awaiting a full evaluation indicates similar problems exist. The initial reports can be found at the GENTECH Web site.

Another interesting observation is that it is possible for users of the identical program and using identical data to create GEDCOM files with some differences. This is because the user can decide how data should be entered. Also noted was the fact that the more comprehensive programs did not place all the information recorded in their GEDCOM files. This is often deliberate. Some information requires that it be transferred in context in order to maintain its value. Simply dumping it to a note file could destroy any value it may have. In the case of Ultimate Family Tree, the GEDCOM procedure is so inadequate that the user wishing to transfer his or her data as intact as possible has no choice but to use The Master Genealogist and its GenBridge feature.

At GENTECH 2001, Randy Bryson of The Church of Jesus Christ of Latter-day Saints, announced a new version of GEDCOM. As anticipated, this new version will use XML coding and the Unicode character set, thus supporting the use of 90 percent of the world’s languages–including Arabic, Japanese, Chinese, Korean, and those languages using the Cyrillic alphabet. It is not expected that this new standard will solve all the problems associated with the transfer of genealogical data. GENTECH’s GEDCOM Testbook Project will continue to monitor data transfers as the developers incorporate this new standard.

Bill Mumford is the project leader of the GEDCOM Testbook Project and a director of GENTECH. He is a contributing editor (software) for the NGS Newsmagazine and past chair of The Alberta Family Histories Society Computer Group. He is the developer of the Genealogical Software Report Card, <www.mumford.ab.ca/reportcard/> and can be reached at mumford@mumford.ca.

Return to the Genealogical Computing Summer 2001 Table of Contents.


  Printer Friendly
 
E-mail to a friend

Search The Library