You are here: Learn > The Library > Daily News Desk > Ancestry Daily News

Ancestry Daily News
1/23/2003 - Archive

•  "RootsWorks: OCR"

"RootsWorks: OCR"
A picture may be worth a thousand words, but you can't sort it by zip code. If the picture is a scanned list of addresses, you might prefer the information in a text or database format, so that you can use it to send out those blind inquiry letters.

A page of text, scanned at 200 dots per inch for an 8-1/2 by 11-inch document, in grayscale, is about 400,000 bytes (before compression). The text might only contain 5,000 text characters, which are about 5,000 bytes. Creating a text version might allow you to get the information from the Web in a small fraction of the time, and to take less storage space besides.

Today, we'll talk about the process of converting images to text. It's known as "optical character recognition" or OCR for short.

What is it?
"Under normal circumstances, a scanned image can't be changed once it's in electronic form. The scanner reads a page as one whole picture, not as individual letters, words, and paragraphs. OCR software, true to its name, recognizes each individual character of scanned text, then transfers them into an editable environment such as a word processing or spreadsheet program."
- Mie-Yun Lee, BuyerZone.com, 31 August 1999

OCR is a computer reading. I used to worry that my little brother would be the first guy who could never learn how to read. I was relieved when he got through the first grade like a normal kid. (They named an elementary school after his teacher.) After that, seeing a computer learn to read doesn't surprise me.

OCR programs read images and output text, into formats that include word processing, spreadsheet, and databases. I use Omnipage Pro 10, but there are others. They fall into two groups - the cheap ones and the good ones. The cheap ones come free with scanners, and cost $150 or less stand-alone. The good ones can cost $300 or more.

Until relatively recently, you were lucky to get straight text converted accurately. Today's OCR programs not only read at accuracy levels above 99%, but they also preserve the fonts, images, and page layout. They can save straight to Web pages. They read newspaper, magazine, dot-matrix, laser print, you name it.

They don't read handwriting very well. I shouldn't make a blanket statement like that. If you have handwriting that looks just like a laser printer, it'll OCR right up.

A scanned copy of a typed court record or will, such as are found in many 20th century cases, can't be reformatted unless you first convert it to text. This is your basic b-flat transcription of typewritten records, only a computer is used by the transcriber to save time.

How Does it Work?
The process of converting a paper document to a text file or Web page has three steps: scanning, recognition, and output. You can automate the whole process into a single click. These programs are so clever that if you have an automatic document feeder on your scanner, and you indicate that the documents are double-sided, you will be prompted to turn the pages over after the fronts are scanned. The program will then render the output in the original page order.

If you're going to work with the text, to change it in any way, don't try to preserve the formatting. It makes text boxes in Word - and taking the text out of the box so you can paste it into a paragraph is time consuming. There is a joke about thinking outside the box here somewhere, but I've misplaced it.

Several applications for genealogists come to mind. A scanned copy of a typed court record or will, such as are found in many 20th century cases, can't be reformatted unless you first convert it to text. Indexing a list, or an article, or a magazine, or a book is tedious. If you've converted it to text, indexing is much easier. A tax list can be OCR'd and then searched.

Name Two of Them
In the October 2002 PC Labs review, OmniPage and FineReader were the two programs reviewed. OmniPage Pro 12 Office is almost $600, and will scan, recognize, file, and read the document to you. If you don't need all the automation, there is a SOHO (small office/home office) version, OmniPage Pro 12, available for just under $150. FineReader is $300. Both companies offer competitive upgrades for much less than the full package prices.

When you consider buying a new scanner, check to see what software is included. You may be able to save a lot of money by purchasing a scanner with a "Lite" version of one of these programs, and be able to upgrade to the full version for much less.

Also, if you use Microsoft Office XP, you can upgrade to OmniPage Pro 12 Office for only $150.

What's the Down Side?
1) It's not perfect. The best programs miss between 2 and 10 characters out of 1,000, depending on the type and quality of the source. While that sounds low, that could be 10 on each laser-printed page. That means you have to check your work - you can't assume it's right. Treat it as if you had sent it out to an expensive typing service, one where they drink coffee with hazelnut creamer, and check what comes back carefully.

2) It's expensive. Three hundred dollars is a lot of money.

3) It takes some real horsepower. Next to full motion video, and CD quality audio, it's about the toughest thing a computer does. Well, I might have omitted gaming, which is more complicated than rocketry. OmniPage Pro 12 requires 150 megabytes of disk space and a Pentium CPU.

Link Me Up

PC Mag - OCR: The Best Yet***
15 Oct 2002

A report of one user's tests.****
An article from India dated Sept 2002.

Supporting material
on the RootsWorks site. ***

What Else?
In a 1998 Usenet discussion, one user wrote, "I cannot imagine (for genealogy) why one would need to manipulate the data from a Marriage License, Birth or Death Certificate, Newspaper articles, Degrees & Diplomas, Obits or even Xeroxed copies from books . . . This is research data that should be left as we find it. We should never change dates, names of misspellings on any of the research data." It's important to point out that the purpose of transcribing documents is not to change them, but to convert them to a different format for different purposes - such as communication, compilation, and indexing.

If you are responsible for society membership records, or for a membership drive, you might find yourself needing to convert paper address lists into databases. In years past, I have OCR'd some address lists of the attendees at genealogical conferences. I learned that it's a lot easier if I send the output from the OCR program to a spreadsheet before I put it into a database. Place names such as streets, towns, and states aren't entered consistently in each case, and it's very easy to sort, copy, and paste the spellings that I prefer if I use a spreadsheet. After the data is in a standard form, it's easier to work with in a database.

Special upgrade offer for Office XP users

RootsWorks Forms:
Do you have questions about using technology for genealogy? Or are you a techno-guru who would like to share your expertise with others? Check out the RootsWorks Discussion forum.



Editor's Note:
The newspaper pages in the Ancestry.com Historical Newspapers Collection were scanned with OCR software. View a free sample issue by clicking on "Historical Newspaper Archives", then "View a Free Sample Issue" from the homepage at Ancestry.com. Or, subscribe to the Newspaper collection:
www.ancestry.com/rd/redir.asp?sourceid=2116&targetid=3505


The RootsWorks series of articles focuses on genealogical applications for generic technologies. Visit the RootsWorks website at www.rootsworks.com to discuss this or previous articles, and your genealogical computing experiences..

Copyright 2003, MyFamily.com Inc.


  Printer Friendly
 
E-mail to a friend

Search The Library



Weekly Journal

Sign up for the Ancestry Weekly Discovery and get free family history tips, news and updates in your inbox.