How can I perform Optical character recognition (OCR) on my scanned document?

First, scan your image with a scanner (e.g. Xsane). Make sure that the image is high-contrast and does not contain specks, as this will confuse the program.

Xsane Settings
- XSane->Setup->Filetype
Check:
Reduce 16 bit image to 8 bit
Set:
TIFF zip compression rate to 1
TIFF 16 bit image compression to no compression
TIFF 8 bit image compression to no compression
TIFF lineart image compression to no compression

There are several options in using Tesseract with Xsane:

- Scan to *.tif, then use tesseract on command line to OCR

tesseract inputimage.tif outputtext -l eng

- Scan to PDF, then use pdf2tif, then tesseract.

pdf2tif filename.pdf (creates tif images of each page)

- ocr.sh will take all pdf files in current directory and turn into txt
Get pdf2tif and ocr.sh from: http://www.groklaw.net/articlebasic.php?story=20061210115516438

ocr.sh filename-01.tif

- Configure Xsane->Setup->OCR to use tesseract script xsane2tess (requires tmp directory in user's home directory. You can edit the script to change TEMP_DIR to something else)

OCR Command = xsane2tess -l eng
Inputfile option: -i
Outputfile option: -o

Tesseract with Russian

tesseract <file>.tif <output_file> -l rus

<output_file> will append .txt to the end of the file name.

---------------------------------------------------------------------------------------

How to scan and OCR like a pro with open source tools
http://www.linux.com/feature/138511

NOTE: for tesseract to work, the tiff file you're running it on needs to be renamed to end in .tif (not .tiff) AND it needs to be an image without an alpha channel. If you've renamed the file and tesseract is still barfing, this is probably the problem. Use an image conversion utility with the ability to remove alpha channels to re-save your image. For bulk image conversion I recommend Imagemagick (it's gpl and runs well on the mac).

to ocr your tiff image, do:

tesseract inputimage.tif outputtext -l eng

and you should get a file called outputtext.txt.

Tesseract can also make use of the libtiff library. (www.libtiff.org)
Without libtiff, Tesseract can only read uncompressed and G3 compressed TIFF files.
This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO OUTPUT
FORMATTING, and NO UI. It can only process an image of a single column and create text from it. It can detect fixed pitch vs proportional text. As of 2.0, Tesseract is fully unicode (UTF-8) enabled, and can recognize 6
languages "out of the box.

----------------------------------------------------------------------------------------------------------------

faq

misc