cancel
Showing results for 
Search instead for 
Did you mean: 

VBO/Assets to extract data from scanned PDF invoice

Hello team,

What are some of the best VBO/Assets you have used to extract data from scanned PDF invoices?

Not considering full-scale IDP engine implementation here instead looking for a quick solution using any DX assets or open library that works reliably on both digital as well as scanned docs.

6 REPLIES 6

Hi,

I found a few assets on the DX portal that can be used:
We also have some KB's related to PDF that might be useful:

Refer to the 'Interfacing with PDF Documents' training course in the Blue Prism University for additional information on interacting with PDF data.

Brigiana Kopec Senior Product Support Engineer (Bilingual) – Americas

jktalgo
Level 3

We are using the built in OCR-reader in our invoice process.

We open the invoice PDF in a MS Edge window. The MS Edge window is spied with Region Mode and then we can use a Read stage with "Read Text with OCR".

Denis__Dennehy
Level 15

The solution in my experience very much depends on the PDF,  are we talking about 1. well structured PDF forms with good accessibility functionality,  2. are we talking about PDF documents that are always in the same structure when copied to clipboard or exported to a text or XML format,  or are we talking about 3. scanned documents?
For 1. you might be surprised how well the UIA interface within Blue Prism works with the document if it is made for accessibility.  For 2.  You might get away with an export and xml or text parsing solution.  For 3.  OCR  technologies are the way to go and if there is a large variance LLMs might be an addition.

How do you handle the zoom level? Also you have the same format invoices or varying layouts?

Most of these are 3rd party paid services OR does not work with scanned PDFs

hello @Tejaskumar_Darji  - We have used Python libraries to get content from scanned PDF.