Theresa Arzadon-Labajo

Building Tesseract RPM
Posted by Theresa Arzadon-Labajo (tarzadon) on Jun 30 2009
Tech Stuff >> Unix-Linux

Building Tesseract RPM

Download tarball from http://code.google.com/p/tesseract-ocr/

Untar the package:

tar -xzvf tesseract-2.03.tar.gz

cd tesseract-2.03

Spec file is included in the tarball. tesseract.spec. Copy spec file to your SPECS directory.

Before building the rpm, make sure libtiff-devel is installed.
yum install libtiff-devel

setarch i386 rpmbuild -ba tesseract.spec

Wrote: /scratch/tarzadon/rpm/SRPMS/tesseract-2.03-4.src.rpm Wrote: /scratch/tarzadon/rpm/RPMS/i386/tesseract-2.03-4.i386.rpm Wrote: /scratch/tarzadon/rpm/RPMS/i386/tesseract-devel-2.03-4.i386.rpm Wrote: /scratch/tarzadon/rpm/RPMS/i386/tesseract-debuginfo-2.03-4.i386.

Now you can install the newly created RPM.

rpm -ivh tesseract-2.03-4.i386.rpm

I edited my tesseract.spec so that it would include all the language files as well as include pdf2tif, ocr.sh and xsane2tess.

---------------------------------------------------------------------------------------

How to scan and OCR like a pro with open source tools
http://www.linux.com/feature/138511

NOTE: for tesseract to work, the tiff file you're running it on needs to be renamed to end in .tif (not .tiff) AND it needs to be an image without an alpha channel. If you've renamed the file and tesseract is still barfing, this is probably the problem. Use an image conversion utility with the ability to remove alpha channels to re-save your image. For bulk image conversion I recommend Imagemagick (it's gpl and runs well on the mac).

to ocr your tiff image, do:

tesseract inputimage.tif outputtext -l eng

and you should get a file called outputtext.txt.

Tesseract can also make use of the libtiff library. (www.libtiff.org)
Without libtiff, Tesseract can only read uncompressed and G3 compressed TIFF files.
This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO OUTPUT
FORMATTING, and NO UI. It can only process an image of a single column and create text from it. It can detect fixed pitch vs proportional text. As of 2.0, Tesseract is fully unicode (UTF-8) enabled, and can recognize 6
languages "out of the box.

----------------------------------------------------------------------------------------------------------------

Using Tesseract with Xsane

Xsane Settings
        - XSane->Setup->Filetype
                Check:
                        Reduce 16 bit image to 8 bit
                Set:
                        TIFF zip compression rate to 1
                        TIFF 16 bit image compression to no compression
                        TIFF 8 bit image compression to no compression
                        TIFF lineart image compression to no compression

There are several options in using Tesseract with Xsane:

- Scan to *.tif, then use tesseract on command line to OCR

tesseract inputimage.tif outputtext -l eng

- Scan to PDF, then use pdf2tif, then tesseract.

                            pdf2tif filename.pdf (creates tif images of each page)

        - ocr.sh will take all pdf files in current directory and turn into txt
                           Get pdf2tif and ocr.sh from: http://www.groklaw.net/articlebasic.php?story=20061210115516438

                           ocr.sh filename-01.tif

        - Configure Xsane->Setup->OCR to use tesseract script xsane2tess (requires tmp directory in user's home directory. You can edit the script to change TEMP_DIR to something else)

                                           OCR Command = xsane2tess -l eng
                                           Inputfile option: -i
                                           Outputfile option: -o

Google Groups

http://groups.google.com/group/tesseract-ocr/

Last changed: Jun 30 2009 at 3:22 PM

Back