cancel
Showing results for 
Search instead for 
Did you mean: 

PDF Data extraction

DavidTaliga
Level 2
Hi Everyone,  As I have been working recently on a project where I had to read data from different types of PDF documents.  I would like to ask if there is planned in future to create an Object in BP which will deal with PDF manipulation, or just some update which will enable better manipulation with PDF documents.   For now we have just only two possible options how to read data from PDF: 1. We can use just simple copy data with Global Send Keys  2. Use Surface Automation to read certain regions in PDF    I think that is not enough, there are reasons: 1. Copy Paste (Global Send Keys) - Data are pasted in different structure, not accordingly from top to bottom like in PDF,  so If we have document which has large amount of words, tables, etc it is almost impossible to catch (calculate) all needed data. It needs too much Effort to extract the correct data without hard coding in calculation stages, even if it is possible. 2. Surface Automation  - Surface automation is still not 100% working approach, customers usually try to avoid this solution and it can crash the process very easy. - Imagine we have many different structured PDFs (different templates of PDF which includes data). To process this data it is needed to capture (make Regions) to each PDF template separately. If we have 2-5 templates, it can be done quite easy but if we have 100 different PDFs ,better option is to do it manually.   Thank you  David  
1 BEST ANSWER

Best Answers

FredrikAdland
Level 4
I strongly recommend using XPDF for PDFs with markable text, it's amazing! In my opinion it's superior to iTextSharp and Adobe functionality (and far, far superior to select all & copy). In addition, XPDF is completely free (iTextSharp is not for commercial use). https://www.xpdfreader.com/download.html (Download the Xpdf tools -> Windows 32/64-bit) Download it to a location, preferably a file server all developers have access to. Use BO Utility - Environment -> 'Start Process'. Application input parameter: ""C:\Windows\System32\cmd.exe"" Arguments input paramter: ""/C start ""&[Quotation mark]&[Quotation mark]&"" ""&[Quotation mark]&[XPDF file path]&""\pdftotext.exe""&[Quotation mark]&"" ""&[Method]&"" ""&[Quotation mark]&[PDF file path]&[Quotation mark] Data items: [Quotation mark] = """""" [Method] = ""-layout"" or ""-table"" (I recommend sending this as a paramater to the business object). [XPDF file path] = Path to the XPDF bin64 folder. Run the 'Start process' action. A txt file with the PDF content should have been created at the same location as the PDF. Use Utility File Managment -> 'Read All Text from File', and voila! You got a great way to read PDF documents. Bonus: If your PDF has foreign characters, change the line from the code stage within 'Read all Text from File' from 'Dim sr As New StreamReader(File_Name)' to 'Dim sr As New StreamReader(File_Name, Encoding.Default, True)'.

View answer in original post

13 REPLIES 13

Denis__Dennehy
Level 15
Hello, i think the guide to interfacing with PDF documents mentions another options - using the adobe API to export the pdf to other formats (Word or XML) where the structure might be easier to extract your data from.  The API would require a paid version of Adobe for it to work. Blue Prism are currently looking into incorporating an intelligent Character Recognition solution into the platform, this would be for extracting data from multi-format documents.  The timescales and implementation of that solution is as yet unknown but I would expect an announcement sometime in 2018.   Den

DavidTaliga
Level 2
Thank you 

FredrikAdland
Level 4
I strongly recommend using XPDF for PDFs with markable text, it's amazing! In my opinion it's superior to iTextSharp and Adobe functionality (and far, far superior to select all & copy). In addition, XPDF is completely free (iTextSharp is not for commercial use). https://www.xpdfreader.com/download.html (Download the Xpdf tools -> Windows 32/64-bit) Download it to a location, preferably a file server all developers have access to. Use BO Utility - Environment -> 'Start Process'. Application input parameter: ""C:\Windows\System32\cmd.exe"" Arguments input paramter: ""/C start ""&[Quotation mark]&[Quotation mark]&"" ""&[Quotation mark]&[XPDF file path]&""\pdftotext.exe""&[Quotation mark]&"" ""&[Method]&"" ""&[Quotation mark]&[PDF file path]&[Quotation mark] Data items: [Quotation mark] = """""" [Method] = ""-layout"" or ""-table"" (I recommend sending this as a paramater to the business object). [XPDF file path] = Path to the XPDF bin64 folder. Run the 'Start process' action. A txt file with the PDF content should have been created at the same location as the PDF. Use Utility File Managment -> 'Read All Text from File', and voila! You got a great way to read PDF documents. Bonus: If your PDF has foreign characters, change the line from the code stage within 'Read all Text from File' from 'Dim sr As New StreamReader(File_Name)' to 'Dim sr As New StreamReader(File_Name, Encoding.Default, True)'.

FredrikAdland
Level 4
Ehm... I kinda needed newlines for that post to be structured. When I pressed 'Preview' I got the error: 'The website encountered an unexpected error. Please try again later.', so I just went for it. Oh well

DavidTaliga
Level 2
Fredrik - Thank you, all your detailed explanation worked correctly 100%. 

SnehasishMondal
Level 2
Hi Fredrik,   Thanks for the solution. Although my code worked correctly once and then it stopped working. Is there any particular reason for that?   Best Regards, Snehasish Mondal

Anonymous
Not applicable
@Fredrik: Thanks. It is working fine. But actually it is saving notepad as ANSI format. Could it possible to change into UTF-8 format?

Anonymous
Not applicable
[Method] = ""-enc UTF-8 -table"" then we can save in unicode format.

​can you provide an example as of yours for the arguments input parameter, just to understand better?
Thanks in advance

------------------------------
Sumit Singh
Application Development Analyst
Accenture
Asia/Kolkata
------------------------------