discovered strange behavior

zdenek.kabatek · ‎29-06-22

Hi,

a client is using Decipher and discovered following behavior.
Assumptions
a) training was finished
b) same documents are being sent again

1. a document is pushed to Decipher
2. document is pretty quickly OCRed and ready for verification
3. load batch in Verification station and only certain fields are extracted and some tables on some pages are not recognized at all
4. go to admin panel and resend the batch but not from the capture but from OCR step. This time it takes much longer than at the beginning (3-5 times longer) to get to Capture performed ready for Verify.
5. load batch to Verification station and all of the sudden tables are recognized on all pages.

When we resend the batch from Capture only the effect is different.

Can you explain why this is happening?

Anyone else experienced this as well?

Thanks.

Regards

Zdenek

------------------------------
Zdeněk Kabátek
Head of Professional Services
NEOOPS
http://www.neoops.com/
Europe/Prague
------------------------------

Ben.Lyons1 · ‎29-06-22

Hi Zdenek,

This is likely where the document has vector data for most of the text, but not all of it. PDFs often contain the text data in an extractable format called "vector data", enabling applications to extract it without OCR. You can test this by seeing if you can highlight the text with your cursor in adobe reader (or similar).

If Decipher extracts this and considers it to be the most important data, it will skip the OCR stage (to maximise processing speed). So when you've restarted it at the OCR stage, it's 'dropped' the vector data and gone with a full OCR read.

If you have single page documents, they can be converted to jpg (or other) and will no longer have the vector data.

That being said, there's an update in 2.2 to manage this in a way that will deliver the best of both. So Decipher will extract the vector data and will always read the non-vector areas with OCR without re-reading the vector areas.

Let me know if that doesn't make complete sense.

Thanks

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
Blue Prism
UK based
------------------------------

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

View answer in original post

Ben.Lyons1 · ‎29-06-22

Hi Zdenek,

This is likely where the document has vector data for most of the text, but not all of it. PDFs often contain the text data in an extractable format called "vector data", enabling applications to extract it without OCR. You can test this by seeing if you can highlight the text with your cursor in adobe reader (or similar).

If Decipher extracts this and considers it to be the most important data, it will skip the OCR stage (to maximise processing speed). So when you've restarted it at the OCR stage, it's 'dropped' the vector data and gone with a full OCR read.

If you have single page documents, they can be converted to jpg (or other) and will no longer have the vector data.

That being said, there's an update in 2.2 to manage this in a way that will deliver the best of both. So Decipher will extract the vector data and will always read the non-vector areas with OCR without re-reading the vector areas.

Let me know if that doesn't make complete sense.

Thanks

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
Blue Prism
UK based
------------------------------

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

zdenek.kabatek · ‎29-06-22

Hi, Ben,

it gives perfect sense and thank you for that explanation.

What I do not understand, however, why the capture results are so different. It did not recognize table on the first page when using vector data (I would assume this to be much more successful approach) but it works well when using OCR. Can you shed some light on this?

When we convert digital PDF (vector data) into jpg wouldn't we risk losing the high reliability of OCR results? In case we have multipage PDF documents with vector data (I call it digital PDF) we can rasterize it into "image" PDF. Again what is the risk of getting worse results due to OCRing image?

Thanks.

Regards

Zdenek

------------------------------
Zdeněk Kabátek
Head of Professional Services
NEOOPS
http://www.neoops.com/
Europe/Prague
------------------------------

Ben.Lyons1 · ‎29-06-22

Hi Zdenek,

I'm sorry, I can't really offer much without seeing the document I'm afraid, you could raise a support request via Expert Connect to investigate.

When converting a PDF, you should be able to maintain a good OCR result providing the document quality is still high. So a maximum quality rasterized pdf should be comparable with the OCR results on the digital PDF. However, vector data is potentially higher quality as it's embedded in the file, though the quality may be affected by the platform which encoded the file.

Thanks

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
Blue Prism
UK based
------------------------------

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

SS&C Blue Prism Community

discovered strange behavior