Wednesday
Hi
we have a scenario in which we have multiple pdfs(structured data)bot need to open the pdf and extract the text and update the same in corresponding Word document(multiple documents)
is there any alternative solution to achieve this
Note: we have tried copy and set clipboard logic but not stable . We don't have any access to import the pdf management utility from digital exchange
Thanks
Wednesday
Hi @rokkam_saiteja ,
You can try opening the PDF file using the Word VBO if you have it with you and then try to see if you can use any corresponding action there to extract your text. Many time it works as a cheat way to go around PDF files when you don't want clipboard actions or external DLL/IDP based solutions.
Thursday
Thanks.
While converting pdf to word using (MS word vbo). some of the data got greyed out due to this we cant able to extract the whole data. Please find the flow
the highlighted text does not get reading
Is there any way to extract the text from pdf
Thanks in advacne
Thursday - last edited Thursday
That is the tricky part, since this approach totally relies on how the PDF file has been rendered in the first place. And many times, the algorithm of the PDF generation is not compatible with the way how Word automatically decodes it.
Do you guys have Adobe Acrobat Pro with you? If yes, there is an export functionality there which can be used to generate the word file. If no, my suggestion would be to go for Adobe Document Services API which unfortunately is a paid license, but our client uses it because of its reliability as of now.
Otherwise, what I would suggest is that you can explore the Python route. But before I suggest anything with it, I want to know if that is even possible in the first place or not?
yesterday
Thanks so much for your stuff @devneetmohanty07
Sorry for that, We don't have any Adobe Acrobat Pro. Could we build our own Object that open pdf file and extract text with use of PDFSharp.dll because in my VM there is one PDF utility having only merge and split files action. We don't have permission to update to new version but we can create actions.
Please find the attachment
If yes, please suggest me the code to extract text based on passing file path as input and output as text.
So that without converting the word to pdf we can extract text.
Note: I have small doubt like do we need to update PDFSharp.dll for every time (Weekly check)
Thanks.
yesterday
I would recommend using ItextSharp instead - much more stable, and easy to implement