|
Ancestry Daily News
1/23/2003 - Archive
"RootsWorks: OCR"
A picture may be worth a thousand words, but you can't sort it by zip code. If the picture is a scanned list of addresses, you might prefer the information
in a text or database format, so that you can use it to send out those blind
inquiry letters.
A page of text, scanned at 200 dots per inch for an 8-1/2 by 11-inch document,
in grayscale, is about 400,000 bytes (before compression). The text might only
contain 5,000 text characters, which are about 5,000 bytes. Creating a text
version might allow you to get the information from the Web in a small fraction
of the time, and to take less storage space besides.
Today, we'll talk about the process of converting images to text. It's known
as "optical character recognition" or OCR for short.
What is it?
"Under normal circumstances, a scanned image can't be changed once it's
in electronic form. The scanner reads a page as one whole picture, not as individual
letters, words, and paragraphs. OCR software, true to its name, recognizes each
individual character of scanned text, then transfers them into an editable environment
such as a word processing or spreadsheet program."
- Mie-Yun Lee, BuyerZone.com, 31 August 1999
OCR is a computer reading. I used to worry that my little brother would be the
first guy who could never learn how to read. I was relieved when he got through
the first grade like a normal kid. (They named an elementary school after his
teacher.) After that, seeing a computer learn to read doesn't surprise me.
OCR programs read images and output text, into formats that include word processing,
spreadsheet, and databases. I use Omnipage Pro 10, but there are others. They
fall into two groups - the cheap ones and the good ones. The cheap ones come
free with scanners, and cost $150 or less stand-alone. The good ones can cost
$300 or more.
Until relatively recently, you were lucky to get straight text converted accurately.
Today's OCR programs not only read at accuracy levels above 99%, but they also
preserve the fonts, images, and page layout. They can save straight to Web pages.
They read newspaper, magazine, dot-matrix, laser print, you name it.
They don't read handwriting very well. I shouldn't make a blanket statement
like that. If you have handwriting that looks just like a laser printer, it'll
OCR right up.
A scanned copy of a typed court record or will, such as are found in many 20th
century cases, can't be reformatted unless you first convert it to text. This
is your basic b-flat transcription of typewritten records, only a computer is
used by the transcriber to save time.
How Does it Work?
The process of converting a paper document to a text file or Web page has three
steps: scanning, recognition, and output. You can automate the whole process
into a single click. These programs are so clever that if you have an automatic
document feeder on your scanner, and you indicate that the documents are double-sided,
you will be prompted to turn the pages over after the fronts are scanned. The
program will then render the output in the original page order.
If you're going to work with the text, to change it in any way, don't try to
preserve the formatting. It makes text boxes in Word - and taking the text out
of the box so you can paste it into a paragraph is time consuming. There is
a joke about thinking outside the box here somewhere, but I've misplaced it.
Several applications for genealogists come to mind. A scanned copy of a typed
court record or will, such as are found in many 20th century cases, can't be
reformatted unless you first convert it to text. Indexing a list, or an article,
or a magazine, or a book is tedious. If you've converted it to text, indexing
is much easier. A tax list can be OCR'd and then searched.
Name Two of Them
In the October 2002 PC Labs review, OmniPage and FineReader were the two programs
reviewed. OmniPage Pro 12 Office is almost $600, and will scan, recognize, file,
and read the document to you. If you don't need all the automation, there is
a SOHO (small office/home office) version, OmniPage Pro 12, available for just
under $150. FineReader is $300. Both companies offer competitive upgrades for
much less than the full package prices.
When you consider buying a new scanner, check to see what software is included.
You may be able to save a lot of money by purchasing a scanner with a "Lite"
version of one of these programs, and be able to upgrade to the full version
for much less.
Also, if you use Microsoft Office XP, you can upgrade to OmniPage Pro 12 Office
for only $150.
What's the Down Side?
1) It's not perfect. The best programs miss between 2 and 10 characters out
of 1,000, depending on the type and quality of the source. While that sounds
low, that could be 10 on each laser-printed page. That means you have to check
your work - you can't assume it's right. Treat it as if you had sent it out
to an expensive typing service, one where they drink coffee with hazelnut creamer,
and check what comes back carefully.
2) It's expensive. Three hundred dollars is a lot of money.
3) It takes some real horsepower. Next to full motion video, and CD quality
audio, it's about the toughest thing a computer does. Well, I might have omitted
gaming, which is more complicated than rocketry. OmniPage Pro 12 requires 150
megabytes of disk space and a Pentium CPU.
Link Me Up
PC Mag - OCR: The Best Yet***
15 Oct 2002
A report of one user's tests.****
An article from India dated Sept 2002.
Supporting material
on the RootsWorks site. ***
What Else?
In a 1998 Usenet discussion, one user wrote, "I cannot imagine (for genealogy)
why one would need to manipulate the data from a Marriage License, Birth or
Death Certificate, Newspaper articles, Degrees & Diplomas, Obits or even
Xeroxed copies from books . . . This is research data that should be left as
we find it. We should never change dates, names of misspellings on any of the
research data." It's important to point out that the purpose of transcribing
documents is not to change them, but to convert them to a different format for
different purposes - such as communication, compilation, and indexing.
If you are responsible for society membership records, or for a membership drive,
you might find yourself needing to convert paper address lists into databases.
In years past, I have OCR'd some address lists of the attendees at genealogical
conferences. I learned that it's a lot easier if I send the output from the
OCR program to a spreadsheet before I put it into a database. Place names such
as streets, towns, and states aren't entered consistently in each case, and
it's very easy to sort, copy, and paste the spellings that I prefer if I use
a spreadsheet. After the data is in a standard form, it's easier to work with
in a database.
Special upgrade offer for Office XP users
RootsWorks Forms:
Do you have questions about using technology for genealogy? Or are you a techno-guru
who would like to share your expertise with others? Check out the RootsWorks
Discussion forum.
Editor's Note:
The newspaper pages in the Ancestry.com Historical Newspapers Collection
were scanned with OCR software. View a free sample issue by clicking on "Historical
Newspaper Archives", then "View a Free Sample Issue" from the
homepage at Ancestry.com. Or, subscribe to the Newspaper collection:
www.ancestry.com/rd/redir.asp?sourceid=2116&targetid=3505
The RootsWorks series of articles focuses on genealogical applications for
generic technologies. Visit the RootsWorks website at www.rootsworks.com
to discuss this or previous articles, and your genealogical computing experiences..
Copyright 2003, MyFamily.com Inc.
|
|
 |
|