Craig Pickering Craig Pickering - 1 year ago 78
iOS Question

Tesseract OCR won't recognize division symbol "÷"

I am using Tesseract in iOS 8 for an OCR based app but it incorrectly converts the division "÷" symbol in the image to a plus "+" sign.

For example, this image

Simple arithmetic expression

always converts to the text string "8+4+4". It should be "8+4÷4".

I've tried using different trained data language files "eng+equ", "ita", adding "÷" to the whitelist, setting the ocr_engine variable to cube, converting image to grayscale or black & white, upsizing the image by 2 and 4 times.

Everything I've tried always returns a plus "+" sign instead of a division "÷" symbol.

I tried using only the "equ" trained data file and that DOES return the division symbol correctly - but all other characters are then garbage.

I've been looking into this (Google, Stackoverflow) for several days and cannot figure it out.

How do I get Tesseract to include and recognize the division "÷" symbol?


The best I have been able to do is to set the AVCaptureSession preset to high

AVCaptureSession *session = [[AVCaptureSession alloc] init];
session.sessionPreset = AVCaptureSessionPresetHigh;

The captured image above dimensions are then 676 × 405 pixels. Using Tesseract OCR UIImage category (image is named 'source') to binarize the image:

// Binarize the source image to improve contrast (using the UIImage category provided by TesseractOCR)
UIImage *blackAndWhiteImage = [source blackAndWhite];
[self.tesseract setImage:blackAndWhiteImage];

This will usually convert the division symbol to the text "-1-", but I've seen "-:-" and other numbers and uppercase characters between the minus signs.

I can check for that in the returned text. But then it is impossible to know whether to treat the returned text "8-1-2" as a true subtraction or 'maybe' division.

Answer Source

Train the or engine wit different fonts.

Here is the tool for training the engine. Have a look on this also

Or you can use JTessBoxEditor