Read Text with OCR Issues

JoshBryan · ‎16-12-21

There was a client that was having some small issues with their "Read Text with OCR". The client had the OCR working most of the time but was sometimes getting small errors. Mainly they were having issues with the OCR being able to "read" Vs and Ws. They were also having issues with some 7s being read as 1s. We were able to figure out a way to get the Read Text with OCR to work correctly. I hope this works for you and your environment.

The OCR that Blue Prism is using is Tesseract. When we updated the Tesseract file to the newest version it was better able to recognize the text we were working with.

We downloaded the English version of Tesseract from here. The files are in alphabetical order, so if English is desired select the eng.traineddat option.

On the machine running Blue Prism, you will need to navigate to the Tesseract file portion of Blue Prism. The default location is "C:\Program Files\Blue Prism Limited\Blue Prism Automate\Tesseract\tessdata" Once there you will see a file called eng.traineddata. At this point you have 2 options:
1- Rename the existing file to something like OLD.traineddata. With that done, move the newly downloaded eng.traineddata into the file. We renamed the new file BEST.traineddata (that will come into play later).
2- Delete the existing eng.traineddata and replace it with the newly downloaded eng.traineddata.

Where my original is named eng.traineddata and the new one is BEST.traineddata; my files look like this,

With the new traineddata in place, go back to your Read stage that was having difficulties. Continue using the same object. If you added a new name to the traineddata you will need to call out the new name in the Language section. In my case, it was using the "BEST". If you replaced the traineddata you do not need to make any changes.

Having both versions of the traineddata on the machine allowed use to call either version of the OCR engine. If I wanted to use the original version I would specify the Language section with "eng" (everything before the period on the original traineddata file). This has the original Tesseract engine work on the OCR. If I wanted to new version I would set the language to "BEST".

I did not have to restart the service or application for the changes to work. With this update, we found the OCR is more accurate. I hope that can help you to get the OCR to "read" better.

Thanks,

------------------------------
Josh Bryan
------------------------------

devneetmohanty07 · ‎16-12-21

Thanks a lot for sharing it with us @Josh Bryan

Working with Teserract OCR definitely has been a tricky hit or miss kind of thing even for me in my prior engagements. This surely is great tip to try and explore :)

------------------------------
----------------------------------
Hope it helps you and if it resolves you query please mark it as the best answer so that others having the same problem can track the answer easily

Regards,
Devneet Mohanty
Intelligent Automation Consultant
Blueprism 6x Certified Professional
Website: https://devneet.github.io/
Email: [email protected]

----------------------------------
------------------------------

---------------------------------------------------------------------------------------------------------------------------------------
Hope this helps you out and if so, please mark the current thread as the 'Answer', so others can refer to the same for reference in future.
Regards,
Devneet Mohanty,
SS&C Blueprism Community MVP 2024,
Automation Architect,
Wonderbotz India Pvt. Ltd.

SS&C Blue Prism Community

Read Text with OCR Issues