VBO/Assets to extract data from scanned PDF invoice

Tejaskumar_Darji · ‎10-09-24

Hello team,

What are some of the best VBO/Assets you have used to extract data from scanned PDF invoices?

Not considering full-scale IDP engine implementation here instead looking for a quick solution using any DX assets or open library that works reliably on both digital as well as scanned docs.

Brigianakopec · ‎10-09-24

Hi,

I found a few assets on the DX portal that can be used:

We also have some KB's related to PDF that might be useful:

How can I work with Adobe Acrobat PDF documents when using Blue Prism Enterprise?
How can I extract data from a PDF document which is contained in a browser window?

Refer to the 'Interfacing with PDF Documents' training course in the Blue Prism University for additional information on interacting with PDF data.

Brigiana Kopec Senior Product Support Engineer (Bilingual) – Americas

jktalgo · ‎11-09-24

We are using the built in OCR-reader in our invoice process.

We open the invoice PDF in a MS Edge window. The MS Edge window is spied with Region Mode and then we can use a Read stage with "Read Text with OCR".

Denis__Dennehy · ‎12-09-24

The solution in my experience very much depends on the PDF, are we talking about 1. well structured PDF forms with good accessibility functionality, 2. are we talking about PDF documents that are always in the same structure when copied to clipboard or exported to a text or XML format, or are we talking about 3. scanned documents?
For 1. you might be surprised how well the UIA interface within Blue Prism works with the document if it is made for accessibility. For 2. You might get away with an export and xml or text parsing solution. For 3. OCR technologies are the way to go and if there is a large variance LLMs might be an addition.

Tejaskumar_Darji · ‎12-09-24

How do you handle the zoom level? Also you have the same format invoices or varying layouts?

Tejaskumar_Darji · ‎12-09-24

Most of these are 3rd party paid services OR does not work with scanned PDFs

Neel1 · ‎12-09-24

hello @Tejaskumar_Darji - We have used Python libraries to get content from scanned PDF.

SS&C Blue Prism Community

VBO/Assets to extract data from scanned PDF invoice