« Strippers CAPTCHA brain power | Main | How business works: part 95 »

Gear Guide
Optical Character Recognition (OCR) isn't something I use much, but it's one of those nice-to-have tools to keep filed away in case you need it. Just such an occasion presented itself this week when someone gave me five pages of printed text which they wanted to edit and update, but for which they no longer had the source file. The options were simple; either sit down and laboriously re-type five dense pages, or OCR it. You can guess what I chose.

There are a number of OCR packages available for Linux. There's the KDE-based Kooka and GOCR for example, but one tester who tried them concluded, "Don't underestimate the value of paying a typist to transcribe your text the old-fashioned way". A few months after that review, Tesseract was released, and he changed his views.

Tesseract was originally developed by HP between 1985 and 1995. In '95 it was one of the top three performers in an OCR accuracy shoot-out, but shortly afterwards HP ditched its OCR business and Tesseract was abandoned. Then, a couple of years ago, they dusted it off again and open sourced it.

You'll find Tesseract an optional add-on for most of the bigger Linux distributions, so it should just be a matter of installing it from your package manager. Note though that it's a dynamic and rapidly advancing project. The version available for Ubuntu is currently 1.02 (from March this year) but the latest version on the Tesseract source site is 2.01 (released on 30 August).

Tessaract requires scanned text to be in the .tif format (although here's a script that'll convert almost any image provided you have ImageMagick installed). From then on it's just a matter of a simple command-line conversion;

tesseract  input-file.tif  output-file

Tesseract generates three output files per input. There's mysterious .RAW and .MAP files -- which you can discard -- and the .TXT file which contains your output.

And it works like a charm! My five pages of admitedly crisp clear text took just a couple of minutes to process. Tesseract doesn't (yet) handle multiple columns, but for basic OCR it sure beats hours of re-typing!



<--Previous Hidden Linux        Next Hidden Linux -->

Comments

The initial release was English only, but I see since version 2 there's been provision for Dutch, German, Italian, Spanish and French, and since version 2.01 Fraktur and Portuguese too. (See http://code.google.com/p/tesseract-ocr/downloads/list)

How about international text? Does it work with accented characters and cyrillics?

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)