Tesseract + Automator = DragnDrop OCR!

Necessity is the mother of invention it is said.. And sometimes it just takes an annoying repetitive task to psuh someone to do something..

I’ve always been interested in Applescript and Automator. These are Apple’s scripting/automating/batch processing frameworks. Applescript is basically a scripting language which allows you to command many OSX apps. The amount of control you have exert over the running of the apps depends on how the app was made (if they put in the hooks for apple script or not), but most Apple apps are pretty ‘scriptable’. Automator is automation for noobs. Instead of writing a script, you just drag and drop “actions” and create a “workflow“, which lets you pass outputs of one action to another and process them. It seems pretty lame at first, but once you start making your own droplets and ‘workflows’ it’s great fun!

Picture 3

So, during one of my labs, the analyzer we were using was unable to store/save the data we captured during the experiment. It was an old analyzer which used 3.5″ floppy disks, but the disk drive has stopped working. So we decided to take photos of the small screen of the analyzer when it displayed the data, and the transcribe them later.

DSC_4634_23

When I saw the sheer number of file which needed to be transcribed (and also my entire evening gone doing that), I thought of doing some OCR (Optical Character Recognition). Google helped me to find Tesseract, a *nix utility which does OCR. Great. I managed to find a MacPort for it and got it to run on OSX.

OK.So far so good, but Tesseract only accepts 1 file as an input and requires that file to be in .tiff format. Now I could have written a bash (or perl :P) script to convert all the files to .tiff and then loop over all the files and call Tesseract but that’s too much work and surely not the ‘Apple way’. So, I called on Automator.

After a bit of tweaking and testing, here is my final workflow which creates a droplet. Any jpg file dropped on this droplet is duplicated, coverted to .tif and OCRed through Tesseract and the output is stored in a file with a suffix .txt

Picture 2

The OCR output was not the best. I had to massage (crop, rotate, gray-scale, etc) the images to get a good output.

You can download it here, but you’ll need Tesseract to make it work. Yay!

Btw, if you’re interested in Automator, check out the videocast Macbreak Ep235-238. And also Ep4-13 of MacBreak Dev

Advertisements

4 comments so far

  1. Bryan on

    Hey, thanks for posting, this looks cool but the dropbox link is 404ing! Would you mind fixing it, I’d love to try this.

    Thanks!

  2. NTT on

    @Bryan.

    Thanks. I fixed the link. Try again.. :)

  3. Law Practice Professionals on

    Date Hi, I have browsed much of your articles. This post is most likely where I acquired probably the most helpful information in my study. Thanks for publishing, perhaps we are able to observe more on this. Are you aware of every other websites on this subject

  4. youtube.com/watch?v=xCEuhjfa-eY on

    It is perfect time to make some plans for the future and it’s time to be happy.
    I have read this post and if I could I want to suggest you few interesting
    things or tips. Perhaps you can write next articles referring to this article.
    I desire to read more things about it!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: