Best Practices for Optical Character Recognition

From AUCWiki
Jump to: navigation, search


Contents

Scope

The Rare Books and Special Collections Library uses OmniPage Professional 18 to convert captured still images of typewritten or printed text into machine-encoded text.

Optical Character Recognition

Settings

The first time you use OmniPage Professional 18, you will need adjust the settings. It's advisable to check the settings every time you are planning to OCR a large number documents to be sure that you are

  1. Click the little button that looks like a gear in the tools menu. This opens the Options window.
  2. In the OCR tab, uncheck verify language choices. This will save headache later.
  3. Select the Process tab.
    1. Uncheck Deskew.
    2. Under Despeckle select None.
    3. Under Page rotation select None.
  4. Click OK
  5. Open Tools -> Saving Preferences
    1. Under Text Converters, select PDF Searchable Image and click Options
    2. Under File Options, select Create one file per page.
    3. Under PDF Compatibility, choose Optimize for quality.
    4. Under Compression Methods, uncheck ALL boxes except Compress using JPEG 2000.
  6. Click OK.
  7. Under Image converters, select JP2 - JPEG 2000 Bitmap and click Options.
  8. Under "Compression", choose Lossless compression (Best quality).
  9. Click OK.
  10. Close Saving Preferences.
  11. Close and re-open OmniPage before you start OCRing to make sure that your new settings have been enabled.

OCR Process

Single or Multiple files

  1. Select Nuance OmniPage Professional 18 from All Programs.
  2. Click 1-2-3.
  3. Select file(s)
  4. When the OCR Proofreader dialog box appears, click Document Ready.
  5. If it the save screen doesn't pop up on its own, click save to file.
  6. Files of type: Save as a PDF Searchable Image. (True page)
  7. File options: Select one file per page
  8. Naming options: Use input file names

Multiple Folders

  1. Select Nuance OmniPage Professional 18 from All Programs.
  2. Click the 1-2-3 icon.
  3. Click Advanced >> in the lower right corner.
  4. Browse to the parent folder and select Add with Subfolders.
  5. Click Add All Images (a change won't appear on screen, but this step is needed.)
  6. Hit OK. At this point, the software will begin the OCR process.
  7. When the OCR Proofreader dialog box appears, click Document Ready.
  8. A Save as File dialog box will appear when the software is ready for you to save your work.
  9. Files of type: Save as a PDF Searchable Image. (True page)
  10. File options: Select one file per page
  11. Naming options: Use input file names with subfolders

Batch Processing Settings

  1. Load Files
    1. Turn Rotation to none
    2. Uncheck despeckle and deskew
    3. Check Keep original image resolution
  2. Recognize images
    1. Layout Automatic
    2. Optimize for Accuracy
    3. Uncheck verify language choices
    4. Check all in retain features
  3. Save
    1. Save as text
    2. Output
      1. Create one file per page
      2. PDF searchable image
      3. Using input file names w/ subfolders

Optimizing PDFs

Settings

  1. Go to Advanced -> PDF Optimizer (you must have a document open in order to access this window)
  2. Adjust the Make compatible with setting to Retain existing.
  3. In the Image Settings section:
    1. Adjust Bicubic Downsampling to 150 ppi for images above 225 ppi.
    2. Toggle Compression to JPEG
    3. Change the Quality to Maximum.
  4. In the Grayscale Images section:
    1. Adjust Bicubic Downsampling to 150 ppi for images above 225 ppi.
    2. Toggle Compression to JPEG
    3. Change the Quality to Maximum.
  5. In the Monochrome Images section:
    1. Adjust Bicubic Downsampling to 300 ppi for images above 450 ppi.
    2. Toggle Compression to CCITT Group 4
  6. Check Optimize images only if there is a reduction in size.
  7. Uncheck Transparency
  8. Adjust the Discard Objects settings:
    1. Check everything.
  9. Adjust Discard User Data settings:
    1. Check everything.
  10. Adjust Clean Up settings:
    1. Compress document structure.
    2. Check everything.
  11. Save your settings in a preset you will remember, e.g. PDF Optimizer.

You can use these settings for a batch job to optimize multiple files.

Batch Optimizing in Acrobat

  1. Go to "Advanced" menu and select "Document Processing" then "Batch Processing"
  2. Choose new sequence and name it "Optimize" or something similar
  3. Item 1: Leave sequence of commands empty
  4. Item 2: Leave as "Ask when sequence is run." (Alternatively, you can assign an entire folder here. This is good for huge batches with subfolders.)
  5. Item 3:Select output location "Same folder as original."
  6. Click output options
    • Under File Naming chose add original to base name, add -opt to insert after, and click don't overwrite existing files
    • Under Output Format unclick fast web view.
    • Click PDF Optimizer, select your preset setting and hit ok twice.
  7. Finally, go back into the batch sequences window, select your sequence, and run it.
    • Hit ok on the first screen.
    • Select all files you want optimized, and that's it!
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox