Issue

Amruthasimplify · ‎29-03-23

I am using Blue Prism's Read with OCR function to extract the NRIC number from a PDF form that is system generated. Although it performs as expected in most situations, it sometimes misinterprets certain characters, such as reading "S" as "$", "I" as ")", and "O" as "0".

Despite experimenting with different combinations of page segmentation, character whitelisting, and scaling, the issue with misinterpretation of characters persists. It would be greatly appreciated if someone could suggest alternative approaches to tackle this problem

System Generated PDF Section.

Configuration as below.

------------------------------
Amrutha Sivarajan
------------------------------

michaeloneil · ‎01-04-23

Hi @Amrutha Sivarajan

There are a number of alternatives on the dx which can read from pdf files or alternatively you could open the file in word and extract the information directly from there as it might be simpler. Or if you need to open it in adobe you could use select all and copy to clipboard actions to get the data and drop it into a word doc or excel file to get the info easily.

------------------------------
Michael ONeil
Technical Lead developer
NTTData
Europe/London
------------------------------

John__Carter · ‎04-04-23

If the PDF contains selectable text then the 'copy all to clipboard and then parse' technique Michael mentions is an option. The advantage is that the data will be 100% accurate (because it's not being interpreted by OCR), the disadvantage is that it may be hard to devise parsing logic that will work on all your files - it depends how the pages have been arranged.

There are 3rd party libraries that can extract text from PDFs (PDFPig for example) but these too can struggle to supply text in a consistent format/structure.

If the PDF text is not selectable, ie the file is a scan/image of text, then OCR is the only option. And because OCR is an interpretation dependent on the image quality, there is always the chance of a mistake and the ones you mention such as zero and capital o are typical. Note that MS Word also uses OCR to open such files and is also subject to similar issues.

You could apply 'post-reading' clean up rules, eg if you know a value cannot possibly contain $ then you could assume to replace it with S. Similarly if a numeric field is read as capital 'o' then you could assume it should be a zero. But such rules can only take you so far and you have to be very confident in your assumed substitutions. The Character Whitelist input can be used in a similar fashion - if you know you are reading a currency field then the whitelist might be $01234567.89

The Scale parameter can really help, but it's not the case that 'bigger is better' - sometime you have to experiment by incrementing the scale until you find the optimum value. You often find that quality decreases after a certain point.

------------------------------
John Carter
Blue Prism
------------------------------

SS&C Blue Prism Community

Issue