|
OCR is the process of converting document images to text that can be
manipulated in an editor or word processor. The image is created by using a
scanner. Once the document exists as a computer image file, the OCR program
analyzes the image and extracts the text. The text file will take up only a
small fraction of the disk storage space the image would take, and can then
be loaded into the word processor of your choice for editing or inclusion into
other documents.
Accuracy of OCR conversion varies widely. With good quality text pages
using just standard fonts it can be in the 99% area. For a page that is not
good quality (a fax or poor photocopy) it can be a lot lower. When a page
contains a lot of fonts in different sizes, that will also affect accuracy.
A professional level OCR program will mark all the characters about which
it is not certain. An operator can then pop the document up on the screen and
go automatically from mark to mark verifying or correcting as needed.
There are specialty OCR programs that can read a printed mailing list and
enter the names and addresses into a database so the list can be handled by
computer programs.
Some document imaging programs have the ability to highlight and OCR
selected words and/or phrases for inclusion in a keyword index which can be
searched to retrieve documents.
|