Decipher recognizes extra char in the target field

dmma · ‎15-03-23

Hello,

I am trying to train Decipher to extract data from invoices, and on some invoices Decipher is extracting ID number incorrectly, by adding extra 'O' char (or sometimes instead Zero char '0' it extracts data as 'O' letter).

Any suggestions, who had similar issue, how to resolve it?

Thanks.

------------------------------
Kind regards,

Dmitrij Mamajev
Senior RPA Developer
Substorm AB
Gothenburg - Sweden
------------------------------

Kind regards, [FirstName] [LastName] [Designation] [JobTitle] [CompanyName] [City] [State] [Phone]

marius-erbert · ‎16-03-23

Hello,

This is a known issue with the tesseract engine. Its not only 0/O, but also other "similar" characters like 4/A, 8/B, 5/S etc.
According to github

e.g. https://github.com/tesseract-ocr/tesseract/issues/2738

The issue should be fixed in tesseract version 6.

Unfortunately, there is no solution to this. I have added an extra validation field so that such cases are detected. Fortunately, the IBAN has a precisely defined number of characters.

You can also try to improve the quality of your documents (300 dpi minimum). This will reduce number of duplications.

BR

Marius

------------------------------
Marius Erbert
------------------------------

dmma · ‎16-03-23

Thanks @MariusErbert

I was thinking maybe somebody came up with some formula solution in Decipher how to tackle this issue 🙂

I was trying to do some formula to replace combination of chars, but I am still not sure how Decipher Formula works. There is not enough of information.

I thought about docs resolution, but the thing is that docs are sent by ~100 different vendors, so it would be an effort to request each and every vendor to send their invoices in higher resolution 🙂

Thanks anyway!

------------------------------
Kind regards,

Dmitrij Mamajev
Senior RPA Developer
Substorm AB
Gothenburg - Sweden
------------------------------

Kind regards, [FirstName] [LastName] [Designation] [JobTitle] [CompanyName] [City] [State] [Phone]

Ben.Lyons1 · ‎16-03-23

Hi Dimitrij,

Depending on how consistent the field format is, you could use Format Expression. This is not just used to validate but also match data, potentially correcting mis-recognised characters. E.g. [A-Z]{2}[0-9]{9}[A-Z]{1}[0-9]{2} or similar, depending on the actual format variables.

Formulas have 2 separate functions, validation and calculation, generally these should not be mixed. For validation it would be used on an assigned field that appears in the document, a calculated field should not be assigned to a field in the document.

Have you watched the video on formulas in the online help?

Thanks

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
Blue Prism
UK based
------------------------------

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

dmma · ‎17-03-23

Hello Ben,

I had applied similar regex as you have proposed, but still, it was picking the data incorrectly... I have decided to remove the format expression value, and Decipher started to recognize the value without any doubts and extra chars.

But now it's unclear how it will be performing in PROD. Potentially it might pick up completely random value for that field 🙂

Thanks for reply!

------------------------------
Kind regards,

Dmitrij Mamajev
Senior RPA Developer
Substorm AB
Gothenburg - Sweden
------------------------------

Kind regards, [FirstName] [LastName] [Designation] [JobTitle] [CompanyName] [City] [State] [Phone]

SS&C Blue Prism Community

Decipher recognizes extra char in the target field