cancel
Showing results for 
Search instead for 
Did you mean: 

PDF with images

HPB-RPA
Level 3

Hi,

I am a long-time user of the Blue Prism platform, currently using version 6.10.6, Blue Prism Enterprise license.

I have a problem with PDF file that contains Images. PDF file: PDF

Can you please tell me which is the best option/solution to get Data from this PDF file? 

Kind regards

4 REPLIES 4

So extracting data from a PDF is quite a wide question.
You can grab the text data in the document by opening up the PDF in adobe and select all and copy to clipboard or using word for the same effect. However if the pdf is basically an image or scan you may have to use OCR. There is the option of using Abbyy Flexi Capture however I can no longer find the associated course on the University courses.

@HPB-RPA hi

 

You can download this asset

https://digitalexchange.blueprism.com/cardDetails?id=137364

Please install the dll in the blue prism folder

asilarow
MVP

I would use the itextsharp library for this.

There are plenty of tutorials available online - see below:

https://psycodedeveloper.wordpress.com/2013/01/10/how-to-extract-images-from-pdf-files-using-c-and-itextsharp/

Andrzej Silarow

I see you've gotten some good answers already, so I almost didn't reply. But I clicked on your PDF link. It's in another language so I cannot tell if it is fake data or not. But be sure that it doesn't contain real, private information about a person or business. If it does, it'd be a good idea to remove it or redact private information in the screenshot. If it's fake data, then disregard this paragraph.

As greg said, it is important to know whether the PDF is machine readable or scanned documents/images. It sounds like you're saying it is the latter (scanned documents/images). If for some reason it's not scanned documents, then try the asset Mohamad linked to. If it is scanned documents, then try something like Blue Prism Decipher, Abbyy, Azure's OCR (whatever it's called now since they keep changing the name), AIBuilder, etc. You'll want to decide whether you're just extracting all the text without the tool understanding what each part of the document is or whether you want to have the tool extract based on known fields that are expected to be in the document. This is not a simple thing to do, no matter what route you go. I'd love to say "oh yeah just do this and boom it's done", but it's gonna take paying for something most likely. Any free solution you put in place is not going to be good enough.


Dave Morris, 3Ci at Southern Company