The best way to read Values from a PDF
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
07-06-18 03:44 PM
Hello,
I want to read with BluePrsim some values (Invoicenumber, Date, total amount) from an PDF-Invoice. Furthermore we want to use this values for several Actions.
What is the best and easiest way to solve this problem.
I would be glad about your suggestions.
Best regards
Stephan
7 REPLIES 7
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
07-06-18 07:14 PM
Hi Stephan,
The first question is: What kind of PDF are you dealing with?
A text PDF (from which you can select the text) or an Image PDF (from which you can not select the text)
In case of Image, you need to go to an OCR solution.
In case of text, you can either copy paste it, or use an commandline tool such as PDFtoTEXT to get the data from the PDF and then parse it.
You can also automate the Adobe Acrobat Reader interface, and execute the option 'save as text' and then read out that text file for the details.
Also refer the the BluePrism manuals on dealing with PDF's.
Bottomline: There are more solutions that challenges ?
Good luck!
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-06-18 02:27 PM
Hello Bastian,
thank you for your help and the description of dealing with PDF`s,
It is a text based and standardized PDF an with Ctrl+A I get all Information from the PDF.
But I need specific Information from the PDF-file like invoice number, amount and so on.
How can I get this Information - what is the best and easiest way to get this Information in Blue Prism - Methods (DataItem?).
Best regards
Stephan
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
12-06-18 02:58 AM
Hi Stephan,
After reading out the PDF you need to get the data from it.
How to get the data from it, depends on how it is structured.
If it is something like this:
Invoice Number: 3423423423
then you can look for the text ""Invoice Number: "" with InStr in a calculation stage to get the postion, and read out everything that follows.
If it is differently structured, you need to use a different approach.
RegEx is always a great thing to use when extracting data. If you search the forum, you'll come accross some nice examples.
Feel free to post an (anonymized) version of the text of your PDF, as it appears in your DataItem, if you need further hints & tricks.
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
26-06-18 04:01 PM
HI All,
Just adding to what Bastiann said.
Take a small example :- Invoice- 1234ATU Name- Sumit
Using CTRL+A read all the elements from the pdf and use CTRL+C to copy the data. Then use the function GetClipborard() function using a Calculation stage and this will be added to the clipboard.
Using Instr() function for ""Invoice-"" that it is present in the pdf or not and also gets the starting point say for I of Invoices as 1. Then do the same for Name- using Instr() and get the index value of N of Name-. Now get the length of Invoice- say 8. now add 1+8=9 and so you are there on the blank space between Invoice- 1234ATU.
Now you are on the 9th position and you have the index value of 'N' also so get all the value from index 9 to index of N and use string manipulation and trim activity to get the value.
Hope this helps.
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
27-06-18 04:54 AM
One thing about doing the A, C in a .pdf, is that you have to make sure that the entire .pdf has loaded. If your .pdf is large, you may need to add a brief wait before doing the copy to make sure that the whole document has loaded.
Launch .pdf in Acrobat Reader
Small wait
Copy to clipboard
Get clipboard to data item
Split Lines to a collection
Loop or filter the collection to find the lines you want
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
27-06-18 01:59 PM
You can also use Apache PDFBox with a command line. This is fast and you dont need Adobe.
https://pdfbox.apache.org/2.0/commandline.html
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
29-06-18 12:48 PM
Hello Stephan,
Above posts suggest to get data from digital pdf to BluePrism Variable. To get specific values from the text It is always good to use RegEx. RegEx are pattern based and independent od position or word count hence always produces the correct values irrespective of words position.
Use action ""Extract Regex Values"" available inside ""Utility - Strings"" object.
Example(Extract Regex Values):
We want to extract Invoice Number, Name and Date from below text:
""Kindly find the invoice detail below:
Invoice No : INV32123 Name : Addy G Date : 10/04/2018
Hope this helps.""
Create a collection with two fields ""Name"" and ""Value"" of type text. Add three rows in Initial Value tab. Name Column values should be ""Invoice, ""Name"" and ""Date"", Value Column Should be empty.
Regex Pattern Looks Like :
Invoice No\s:\s(?\w*\s*)\s*Name\s:\s(?\w*\s\w*\s*)\s*Date\s:\s(?\d\d\/\d\d\/\d\d\d\d)
Text in bold in regex should be the same as column created in name value collection.
Use same name value collection for output.
Google ""Regex cheat sheet"" to get the better understanding of regular expression. There are few websites are also available to test, debug and create regular expression online.
Hope this helps. 🙂
