cancel
Showing results for 
Search instead for 
Did you mean: 

The best way to read Values from a PDF

StephanSaar
Level 2
Hello, I want to read with BluePrsim some values (Invoicenumber, Date, total amount) from an PDF-Invoice. Furthermore we want to use this values for several Actions. What is the best and easiest way to solve this problem. I would be glad about your suggestions. Best regards Stephan
7 REPLIES 7

BastiaanBezemer
Level 5
Hi Stephan, The first question is: What kind of PDF are you dealing with? A text PDF (from which you can select the text) or an Image PDF (from which you can not select the text) In case of Image, you need to go to an OCR solution. In case of text, you can either copy paste it, or use an commandline tool such as PDFtoTEXT to get the data from the PDF and then parse it. You can also automate the Adobe Acrobat Reader interface, and execute the option 'save as text' and then read out that text file for the details. Also refer the the BluePrism manuals on dealing with PDF's. Bottomline: There are more solutions that challenges ? Good luck!

StephanSaar
Level 2
Hello Bastian, thank you for your help and the description of dealing with PDF`s, It is a text based and standardized PDF an with Ctrl+A I get all Information from the PDF.  But I need specific Information from the PDF-file like invoice number, amount and so on. How can I get this Information - what is the best and easiest way to get this Information in Blue Prism - Methods (DataItem?). Best regards Stephan    

BastiaanBezemer
Level 5
Hi Stephan, After reading out the PDF you need to get the data from it. How to get the data from it, depends on how it is structured. If it is something like this: Invoice Number: 3423423423 then you can look for the text ""Invoice Number: "" with InStr in a calculation stage to get the postion, and read out everything that follows. If it is differently structured, you need to use a different approach. RegEx is always a great thing to use when extracting data. If you search the forum, you'll come accross some nice examples. Feel free to post an (anonymized) version of the text of your PDF, as it appears in your DataItem, if you need further hints & tricks.

Sumitkumar2
Level 2
HI All, Just adding to what Bastiann said. Take a small example :- Invoice- 1234ATU  Name- Sumit Using CTRL+A read all the elements from the pdf and use CTRL+C to copy the data. Then use the function GetClipborard() function using a Calculation stage and this will be added to the clipboard. Using Instr() function for ""Invoice-"" that it is present in the pdf or not and also gets the starting point say for I of Invoices as 1. Then do the same for Name- using Instr() and get the index value of N of Name-. Now get the length of Invoice- say 8. now add 1+8=9 and so you are there on the blank space between Invoice- 1234ATU. Now you are on the 9th position and you have the index value of 'N' also so get all the value from index 9 to index of N and use string manipulation and trim activity to get the value.   Hope this helps.    

JimmyMcCrillis
Level 2
One thing about doing the A, C in a .pdf, is that you have to make sure that the entire .pdf has loaded.  If your .pdf is large, you may need to add a brief wait before doing the copy to make sure that the whole document has loaded. Launch .pdf in Acrobat Reader Small wait Copy to clipboard Get clipboard to data item Split Lines to a collection Loop or filter the collection to find the lines you want

MarcoSchulze1
Level 3
You can also use Apache PDFBox with a command line. This is fast and you dont need Adobe.   https://pdfbox.apache.org/2.0/commandline.html  

MayurGangrade
Level 4
Hello Stephan, Above posts suggest to get data from digital pdf to BluePrism Variable. To get specific values from the text It is always good to use RegEx. RegEx are pattern based and independent od position or word count hence always produces the correct values irrespective of words position. Use action ""Extract Regex Values"" available inside ""Utility - Strings"" object.  Example(Extract Regex Values):  We want to extract Invoice Number, Name and Date from below text:  ""Kindly find the invoice detail below: Invoice No : INV32123 Name : Addy G Date : 10/04/2018 Hope this helps."" Create a collection with two fields ""Name"" and ""Value"" of type text. Add three rows in Initial Value tab. Name Column values should be ""Invoice, ""Name"" and ""Date"", Value Column Should be empty. Regex Pattern Looks Like :  Invoice No\s:\s(?\w*\s*)\s*Name\s:\s(?\w*\s\w*\s*)\s*Date\s:\s(?\d\d\/\d\d\/\d\d\d\d) Text in bold in regex should be the same as column created in name value collection. Use same name value collection for output. Google ""Regex cheat sheet"" to get the better understanding of regular expression. There are few websites are also available to test, debug and create regular expression online. Hope this helps. :)