cancel
Showing results for 
Search instead for 
Did you mean: 

Extract text from pdf and update the same in word

rokkam_saiteja
Level 4

Hi 

we have a scenario in which we have multiple pdfs(structured data)bot need to open the pdf and extract the text and update the same in corresponding Word document(multiple documents)

is there any alternative solution to achieve this

Note: we have tried copy and set clipboard logic but not stable . We don't have any access to import the pdf management  utility from digital exchange  

Thanks 

5 REPLIES 5

Hi @rokkam_saiteja ,

You can try opening the PDF file using the Word VBO if you have it with you and then try to see if you can use any corresponding action there to extract your text. Many time it works as a cheat way to go around PDF files when you don't want clipboard actions or external DLL/IDP based solutions.

---------------------------------------------------------------------------------------------------------------------------------------
Hope this helps you out and if so, please mark the current thread as the 'Answer', so others can refer to the same for reference in future.
Regards,
Devneet Mohanty,
SS&C Blueprism Community MVP 2024,
Automation Architect,
Wonderbotz India Pvt. Ltd.

Thanks.

While converting pdf to word using (MS word vbo). some of the data got greyed out due to this we cant able to extract the whole data. Please find the flow 

rokkam_saiteja_0-1720076587952.png

the highlighted text does not get reading 

rokkam_saiteja_2-1720076870984.pngrokkam_saiteja_3-1720077012647.png

Is there any way to extract the text from pdf

Thanks in advacne

 

 

That is the tricky part, since this approach totally relies on how the PDF file has been rendered in the first place. And many times, the algorithm of the PDF generation is not compatible with the way how Word automatically decodes it. 

Do you guys have Adobe Acrobat Pro with you? If yes, there is an export functionality there which can be used to generate the word file. If no, my suggestion would be to go for Adobe Document Services API which unfortunately is a paid license, but our client uses it because of its reliability as of now.

Otherwise, what I would suggest is that you can explore the Python route. But before I suggest anything with it, I want to know if that is even possible in the first place or not?

---------------------------------------------------------------------------------------------------------------------------------------
Hope this helps you out and if so, please mark the current thread as the 'Answer', so others can refer to the same for reference in future.
Regards,
Devneet Mohanty,
SS&C Blueprism Community MVP 2024,
Automation Architect,
Wonderbotz India Pvt. Ltd.

Thanks so much for your stuff @devneetmohanty07 

Sorry for that, We don't have any Adobe Acrobat Pro. Could we build our own Object that open pdf file and extract text with use of PDFSharp.dll because in my VM there is one PDF utility having only merge and split files action. We don't have permission to update to new version but we can create actions.

Please find the attachment 

rokkam_saiteja_0-1720160688381.png

rokkam_saiteja_1-1720160706918.png

If yes, please suggest me the code to extract text based on passing file path as input and output as text.

So that without converting the word to pdf we can extract text.

Note: I have small doubt like do we need to update PDFSharp.dll for every time (Weekly check)

Thanks.

 

 

 

 

I would recommend using ItextSharp instead - much more stable, and easy to implement

Andrzej Silarow