cancel
Showing results for 
Search instead for 
Did you mean: 

PDF extraction with checkbox field

Amruthasimplify
Level 5
Hi All,

Can you assist me in finding a solution to extract data from a PDF without relying on external applications, as my organization requires the use of only Blue Prism approved objects and native tools?
I have attempted using Global Send Keys, but it doesn't seem to work well for capturing data from the PDF, which includes text boxes, multi-line fields, and checkboxes. Also, there is a possibility of rearranging the field positions in the future, making it inappropriate to use field position references for data extraction. The PDF can have more than 3 pages.
Is there any alternative method that allows for capturing data, including checkbox values, in a more efficient manner?

Sample of the fields are shown below.

20323.pngThanks in advance.

------------------------------
Amrutha Sivarajan
------------------------------
12 REPLIES 12

PvD_SE
Level 12
Hi Amrutha,

Two suggestions that perhaps can help you:
  • I think there's an object for PFD's in DX
  • Last week someone with a similar challenge was advised to try and open the pdf in Word


------------------------------
Happy coding!
---------------
Paul
Sweden
------------------------------
Happy coding!
Paul, Sweden
(By all means, do not mark this as the best answer!)

Hi Paul

Can you please provide the DX link of the object?

Many thanks in advance



------------------------------
Manish Rawat
Project Manager
Mercer
New Delhi
------------------------------

Hi Manish,

I wrote '...I think...' implying I am not sure as we do not use any DX objects in our shop.

That said, My '...I know...' is based on earlier posts on this subject in this community, so some googling on your side will probably unearth clues as to where to find any such DX object.



------------------------------
Happy coding!
---------------
Paul
Sweden
------------------------------
Happy coding!
Paul, Sweden
(By all means, do not mark this as the best answer!)

SahilChankotra
Level 4

Hi Amrutha,

I got one process last year where we had to extract some data from the pdf files. I used alternative way to do this task. I converted pdf files to excel file and then with the help of excel utility I read cells value.

you can also try this method.



------------------------------
Sahil Chankotra
------------------------------

Hi Amrutha,

Last year, I worked on the automation where I have to update and extract the data from PDF forms. I have used C# code and Itextsharp dll for this use case. 

Please find below the details - 

Inputs - filePath(Text)

Outputs - outputText(Text), Success(Flag), Message(Text)

Code -

Success = true;
Message = "";
outputText = "";
StringBuilder text = new StringBuilder();
PdfReader pdfReader = null;

var pdf_filename = filePath;
try{
pdfReader = new PdfReader(pdf_filename);
{
    var fields = pdfReader.AcroFields.Fields;

    foreach (var key in fields.Keys)
    {
        var value = pdfReader.AcroFields.GetField(key);
        
        text.Append(key+"----"+value+";");
    }
outputText = text.ToString();
}
}
catch(Exception exx) {
    Success = false;
    Message = exx.Message;
}
finally {
    if (pdfReader != null)
                {
                    pdfReader.Close();
                }
}

You will get the details in text data item and after that use the split text with character ;( as mentioned in code - text.Append(key+"----"+value+";")).

Also, you need to import the dlls in code option - 

  1. C:\Program Files\Blue Prism Limited\Blue Prism Automate\itextsharp.dll
  2. C:\Program Files\Blue Prism Limited\Blue Prism Automate\BouncyCastle.Crypto.dll

Please let me know if you need any additional information.



------------------------------
KirtiMaan Talwar
Consultant
Deloitte
------------------------------
KirtiMaan Talwar
IA Consultant
Deloitte USI

Thank you Sahil.

I tried your approach unfortunately the Excel is reading some fields as image and its not returning structured data. I'm getting a mix of image and text values for PDF to Excel conversion.



------------------------------
Amrutha Sivarajan
------------------------------

Thanks a lot for your detailed explanation.  I truly appreciate your effort.

I would like to try out the method you have suggested. If you don't mind can you share me the authenticated URLs for downloading the DLLs?

I had tried using BP objects from Digital exchange and worked on few python codes to read the PDF. Since the PDF is editable, its unable to read the field values and is able to read the field labels alone. 



------------------------------
Amrutha Sivarajan
------------------------------

Thanks Paul for your suggestion.

I tried few objects from DX and tried converting the PDF into word and excel. It is not able to extract the data and the information is read either as image or blank values as the PDF is editable form.



------------------------------
Amrutha Sivarajan
------------------------------

Hi @Amrutha Sivarajan ,

Did you try opening the pdf file in chrome or any other browser? Opening a file using a browser sometimes helps in spying the relevant elements and you can try reading the checkbox values.



------------------------------
Manpreet Kaur
Manager
Deloitte
*If you find this post helpful mark it as Best Answer
------------------------------