Tuesday
Hi,
I am a long-time user of the Blue Prism platform, currently using version 6.10.6, Blue Prism Enterprise license.
I have a problem with PDF file that contains Images. PDF file: PDF
Can you please tell me which is the best option/solution to get Data from this PDF file?
Kind regards
Tuesday
So extracting data from a PDF is quite a wide question.
You can grab the text data in the document by opening up the PDF in adobe and select all and copy to clipboard or using word for the same effect. However if the pdf is basically an image or scan you may have to use OCR. There is the option of using Abbyy Flexi Capture however I can no longer find the associated course on the University courses.
Wednesday
@HPB-RPA hi
You can download this asset
https://digitalexchange.blueprism.com/cardDetails?id=137364
Please install the dll in the blue prism folder
Wednesday
I would use the itextsharp library for this.
There are plenty of tutorials available online - see below:
Wednesday
I see you've gotten some good answers already, so I almost didn't reply. But I clicked on your PDF link. It's in another language so I cannot tell if it is fake data or not. But be sure that it doesn't contain real, private information about a person or business. If it does, it'd be a good idea to remove it or redact private information in the screenshot. If it's fake data, then disregard this paragraph.
As greg said, it is important to know whether the PDF is machine readable or scanned documents/images. It sounds like you're saying it is the latter (scanned documents/images). If for some reason it's not scanned documents, then try the asset Mohamad linked to. If it is scanned documents, then try something like Blue Prism Decipher, Abbyy, Azure's OCR (whatever it's called now since they keep changing the name), AIBuilder, etc. You'll want to decide whether you're just extracting all the text without the tool understanding what each part of the document is or whether you want to have the tool extract based on known fields that are expected to be in the document. This is not a simple thing to do, no matter what route you go. I'd love to say "oh yeah just do this and boom it's done", but it's gonna take paying for something most likely. Any free solution you put in place is not going to be good enough.