topic RE: pdf data extraction in Digital Exchange

pdf data extraction

aseelodeh — Sun, 04 Jul 2021 12:38:00 GMT

hello, what is the best free way for extracting data from PDF?

------------------------------
aseel odeh
------------------------------

RE: pdf data extraction

Sai_Devendra_Ku — Mon, 05 Jul 2021 03:22:00 GMT

You can look into Decipher-IDP a product of Blue Prism, which helps extract data from PDF.

https://portal.blueprism.com/product/related-products/blue-prism-decipher-idp-11

------------------------------
Sai Devendra Kumar Komma
------------------------------

RE: pdf data extraction

EmersonF — Mon, 05 Jul 2021 13:37:00 GMT

@aseelodeh, The best option is Decipher, but if it's someone like a string, dumb thing, you can copy the content to a data item and do a regex for the desired value, if you just need to validate if a word exists, use InStr()

------------------------------
Emerson Ferreira
Sr Business Analyst
Avanade Brasil
+55 (081) 98886-9544
If my answer helped you? Mark as useful!
------------------------------

RE: pdf data extraction

aseelodeh — Tue, 06 Jul 2021 10:03:00 GMT

yes, as you know copying data from PDF extracts the text without formatting, do you have a way other than regEX for processing the data in an excel file? i need it for multiple different files

------------------------------
aseel odeh
------------------------------

RE: pdf data extraction

ewilson — Tue, 06 Jul 2021 13:26:00 GMT

There are various ways to extract data from PDFs. The "best" way depends on your specific use case and the make up of the PDFs that you'll be dealing with. Some examples have been mentioned above. Additional examples for extracting data include:

Use the PDF Toolkit from the DX to convert the PDF to a Word doc and then use the MS Word VBO to work with the contents.
Use the open source Xpdf Tools to convert a PDF to text and then use the Strings utility VBO to work with the text.

Cheers,

------------------------------
Eric Wilson
Director, Partner Integrations for Digital Exchange
Blue Prism
------------------------------

RE: pdf data extraction

aseelodeh — Tue, 06 Jul 2021 14:08:00 GMT

ok,
if the PDF is editable and can be copied, do you have a method for integrating and processing data into excel?
notice that I have different PDF formats

------------------------------
aseel odeh
------------------------------

RE: pdf data extraction

ewilson — Tue, 06 Jul 2021 17:28:00 GMT

@aseelodeh,

The PDF Toolkit, mentioned above, uses Adobe's Document Cloud platform. There's an action in the VBO called ExportPDFToDocx. You could copy that action into a new action and then change the following line of code in the code stage and I believe it would export the input PDF as an XLSX file.

Change the above highlighted line to this:

ExportPDFOperation exportPdfOperation = ExportPDFOperation.CreateNew(ExportPDFTargetFormat.XLSX);

Cheers,

------------------------------
Eric Wilson
Director, Partner Integrations for Digital Exchange
Blue Prism
------------------------------

RE: pdf data extraction

aseelodeh — Tue, 06 Jul 2021 18:02:00 GMT

does keeping the "CredentialsFilePath" Empty causing an error? if yes what meant by this? I have no credentials for PDF reader

------------------------------
aseel odeh
------------------------------

RE: pdf data extraction

ewilson — Wed, 07 Jul 2021 12:38:00 GMT

@aseelodeh,

The PDF Toolkit requires an account with Adobe Document Cloud. You can sign up for a free developer account with them for testing.

Cheers,

------------------------------
Eric Wilson
Director, Partner Integrations for Digital Exchange
Blue Prism
------------------------------