Raptor Raptor - 18 days ago 7
Android Question

Tess-Two (Tesseract OCR in Android) shows very inaccurate results

I use the following function to perform offline OCR using Tesseract OCR's Android fork Tess-Two :

private String startOCR(Uri imgUri) {
try {
ExifInterface exif = new ExifInterface(imgUri.getPath());
int exifOrientation = exif.getAttributeInt(ExifInterface.TAG_ORIENTATION, ExifInterface.ORIENTATION_NORMAL);
int rotate = 0;
switch(exifOrientation) {
case ExifInterface.ORIENTATION_ROTATE_90:
rotate = 90;
break;
case ExifInterface.ORIENTATION_ROTATE_180:
rotate = 180;
break;
case ExifInterface.ORIENTATION_ROTATE_270:
rotate = 270;
break;
}
Log.d(TAG, "Rotation: " + rotate);

BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 4; // 1 - means max size. 4 - means maxsize/4 size. Don't use value <4, because you need more memory in the heap to store your data.
// set to 300 dpi
options.inTargetDensity = 300;
Bitmap bitmap = BitmapFactory.decodeFile(imgUri.getPath(), options);

// Change Orientation via EXIF
if (rotate != 0) {

// Getting width & height of the given image.
int w = bitmap.getWidth();
int h = bitmap.getHeight();

// Setting pre rotate
Matrix mtx = new Matrix();
mtx.preRotate(rotate);

// Rotating Bitmap
bitmap = Bitmap.createBitmap(bitmap, 0, 0, w, h, mtx, false);

}

// To Grayscale
bitmap = toGrayscale(bitmap);

final Bitmap b = bitmap;

final ImageView ivResult = (ImageView)findViewById(R.id.ivResult);
if(ivResult != null) {
runOnUiThread(new Runnable() {
@Override
public void run() {
ivResult.setImageBitmap(b);
}
});

}
return extractText(bitmap);
} catch (Exception e) {
Log.e(TAG, e.getMessage());
return "";
}
}


and here is the
extractText()
method:

private String extractText(Bitmap bitmap) {
//Log.d(TAG, "extractText");
try {
tessBaseApi = new TessBaseAPI();
} catch (Exception e) {
Log.e(TAG, e.getMessage());
if (tessBaseApi == null) {
Log.e(TAG, "TessBaseAPI is null. TessFactory not returning tess object.");
}
}

tessBaseApi.init(DATA_PATH, lang);

//EXTRA SETTINGS
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "abcdefghijklmnopqrstuvwxyz1234567890',.?;/ ");

Log.d(TAG, "Training file loaded");
tessBaseApi.setDebug(true);
tessBaseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_AUTO_OSD);
tessBaseApi.setImage(bitmap);
String extractedText = "empty result";
try {
extractedText = tessBaseApi.getUTF8Text();
} catch (Exception e) {
Log.e(TAG, "Error in recognizing text.");
}
tessBaseApi.end();
return extractedText;
}


The value returned by
extractText()
is shown in the following screenshot:

inaccurate result

Accuracy is super low, though I make the image grayscale & upscale to 300 dpi before performing OCR. How can I improve the results? Is the trained data not good enough?

Answer

I've made some tests, however, I have some points and conclusions that could improve your result.

  1. Try pass lowercase and uppercase letters in your VAR_WHITE_CHARLIST variable parameter:

See my results for this input:

enter image description here

a) Lowercase only:

Parameter:

baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "abcdefghijklmnopqrstuvwxyz1234567890',.?;/ ");

Result:

05 atenienses nnito, hdeleto e laicao, os principais acusadores de gocrates, nao defendiam apenas que o filosofo corrompia a juventude; eles lutavam tama bern pelas virtudes da tradigao poetica vinculada a liornero. nristofanes, um dos responsaveis, segundo socrates, dos preconceitos contra o filosofo, era outro grande defensor dessa virtude.

socrates, de certa forma, estava em guerra com a tradieao poetica grega. 0 metodo de socrates era o oposto a narrativa epica de tlornero. sua dialetica nao tinha nada de semideuses corn superpoderes 6

b) Uppercase and Lowercase letters:

Parameter:

baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ1234567890',.?;/ ");

Result:

Os atenienses Anito, Meleto e Licao, os principais acusadores de Socrates, nao defendiam apenas que o filosofo corrompia a juventude; eles lutavam tama bern pelas virtudes da tradigao poetica vinculada a Homero. Aristofanes, um dos responsaveis, segundo socrates, dos preconceitos contra o filosofo, era outro grande defensor dessa virtude.

socrates, de certa forma, estava em guerra com a tradieao poetica grega. O metodo de socrates era o Oposto a narrativa epica de Homero. Sua dialetica nao tinha nada de semideuses corn superpoderes 6

PS: I've ran this example using Portuguese language, check that in some words that need different chars like: 'é ó ç' it didn't work, because it wasn't passed as char into white list.

I also tried to ran using your picture, the result has improved (not so much):

Font 20; Which polrlrcran has caplured Ihe curve, summed up a growing mood. In a Ierocrous speech? 'Your iron industry is dead. dead as munon. Your coal yum mono greatly on the iron Vbur Ilk Mary is and. o Your woolen induslry is Why. Your canon Mr Wilding induslry. blmailf

So i checked how tesseract binarized the image:

Theresholded Image

Your image have so much noise, then the api try to binarize your image that made a huge part of your picture illegible. I suggest you try run again, but without pass to grayscale, and try to research how to decrease the noise in your image.

I hope that it would be useful for you! Thank you for sharing your issue!

Abraços!

Comments