Extract text from pdf and update the same in word

rokkam_saiteja · ‎03-07-24

Hi

we have a scenario in which we have multiple pdfs(structured data)bot need to open the pdf and extract the text and update the same in corresponding Word document(multiple documents)

is there any alternative solution to achieve this

Note: we have tried copy and set clipboard logic but not stable . We don't have any access to import the pdf management utility from digital exchange

Thanks

devneetmohanty07 · ‎03-07-24

Hi @rokkam_saiteja ,

You can try opening the PDF file using the Word VBO if you have it with you and then try to see if you can use any corresponding action there to extract your text. Many time it works as a cheat way to go around PDF files when you don't want clipboard actions or external DLL/IDP based solutions.

---------------------------------------------------------------------------------------------------------------------------------------
Hope this helps you out and if so, please mark the current thread as the 'Answer', so others can refer to the same for reference in future.
Regards,
Devneet Mohanty,
SS&C Blueprism Community MVP 2024,
Automation Architect,
Wonderbotz India Pvt. Ltd.

rokkam_saiteja · ‎04-07-24

Thanks.

While converting pdf to word using (MS word vbo). some of the data got greyed out due to this we cant able to extract the whole data. Please find the flow

the highlighted text does not get reading

Is there any way to extract the text from pdf

Thanks in advacne

devneetmohanty07 · ‎04-07-24

That is the tricky part, since this approach totally relies on how the PDF file has been rendered in the first place. And many times, the algorithm of the PDF generation is not compatible with the way how Word automatically decodes it.

Do you guys have Adobe Acrobat Pro with you? If yes, there is an export functionality there which can be used to generate the word file. If no, my suggestion would be to go for Adobe Document Services API which unfortunately is a paid license, but our client uses it because of its reliability as of now.

Otherwise, what I would suggest is that you can explore the Python route. But before I suggest anything with it, I want to know if that is even possible in the first place or not?

---------------------------------------------------------------------------------------------------------------------------------------
Hope this helps you out and if so, please mark the current thread as the 'Answer', so others can refer to the same for reference in future.
Regards,
Devneet Mohanty,
SS&C Blueprism Community MVP 2024,
Automation Architect,
Wonderbotz India Pvt. Ltd.

rokkam_saiteja · ‎05-07-24

Thanks so much for your stuff @devneetmohanty07

Sorry for that, We don't have any Adobe Acrobat Pro. Could we build our own Object that open pdf file and extract text with use of PDFSharp.dll because in my VM there is one PDF utility having only merge and split files action. We don't have permission to update to new version but we can create actions.

Please find the attachment

If yes, please suggest me the code to extract text based on passing file path as input and output as text.

So that without converting the word to pdf we can extract text.

Note: I have small doubt like do we need to update PDFSharp.dll for every time (Weekly check)

Thanks.

asilarow · ‎05-07-24

I would recommend using ItextSharp instead - much more stable, and easy to implement

Andrzej Silarow

faheemsd · ‎07-07-24

@rokkam_saiteja

I have created the custom object to read the data from PDF using Blue Prism.

Please download the Object from the PDF link and Import the Object in Blue Prism with the required itextsharp.dll
Please try this Object and try to read the the PDF data.

Digital Exchange

Syed Faheem
RPA Tech Lead

devneetmohanty07 · ‎07-07-24

@rokkam_saiteja - Splitting and merging are quite easy to do via PDFSharp however, extracting text is not a feature that can be done easily using this library at least with the older versions that supported .NET framework 2.0. The latest releases have a dependency of netstandard 7 which I have seen frequently giving compiler issues with Blue Prism code stages even for other libraries like OpenXML so the only way to do it would be to create a DLL in Visual Studio and use the functions from there in the Blue Prism object studio which can be doable but will need to test it.

I can see, you have been already recommended here with iTextSharp which I agree is a much better and trusted option. However, one thing I just want to bring to your notice is that it is currently under AGPL license (Affero General Public License) which technically and legally states that you cannot use it as part of a commercial product or project unless you purchase the same. With that being said, I have seen many organizations using it for free and some who have paid for it. I can't say it is strictly being enforced but yes as per licensing terms you should discuss the same with your stakeholders and proceed accordingly as there are policies which can always be impacted and in many of my clients it was restricted as well.

Also, if you organization is fine with paying for licenses, my suggestion would be to opt for Adobe Document Services as it can provide you many utilities and that too with a much better accuracy.

---------------------------------------------------------------------------------------------------------------------------------------
Hope this helps you out and if so, please mark the current thread as the 'Answer', so others can refer to the same for reference in future.
Regards,
Devneet Mohanty,
SS&C Blueprism Community MVP 2024,
Automation Architect,
Wonderbotz India Pvt. Ltd.

SS&C Blue Prism Community

Extract text from pdf and update the same in word