Harsh Wardhan Harsh Wardhan - 6 days ago 7
Python Question

Tesseract OCR gives misaligned output text

I have an image like this

SOME STUFF HERE

DEPARTMENT OF PATHOLOGY

Name : MR. V. HUGO Age/Sex : 31 Y(s)/Male

Bill Date : 28-Apr-2016 08:48 AM UMR No : ODC61995

Sample Date : 28-Apr-2016 09:38 AM Bill No : BIL130579

Report Date : 28-Apr-2016 04:21 PM Result No : RES378704


AND SOME MORE STUFF HERE


The above image is rectangular in shape where the length is greater than the width. I crop the above image to the portion of image that I need to read and it looks like this

Name : MR. V. HUGO Age/Sex : 31 Y(s)/Male

Bill Date : 28-Apr-2016 08:48 AM UMR No : ODC61995

Sample Date : 28-Apr-2016 09:38 AM Bill No : BIL130579

Report Date : 28-Apr-2016 04:21 PM Result No : RES378704


In the cropped image the width is greater than the length. But the output I get is misaligned

Name
Bill Date
Sample Date
Report Date

MR. V. HUGO
28-Apr-2016 08:48 AM
28-Apr-2016 09:38 AM
28-Apr-2016 04:21 PM

Age/Sex
UMR No
Bill No
Result No

31 Y(s)/Male
ODC61995
BIL130579
RES378704


Can anybody please explain why this happens? Without cropping, the output is aligned properly but the errors are more. My idea is to run the Tesseract OCR on the relevant portion of the image only. I am getting the same result with and without the Python wrapper.

P.S. - I get misaligned output similar to above also when I apply erosion/dilation to the image, but no cropping, before passing it to Tesseract.

Answer

The problem is due to the automatic page segmentation done by Tesseract. Keep your page segmentation mode value as 4 or PSM_SINGLE_COLUMN.

tesseract example.jpg out -l eng -psm 4