How to split PDF based on a specific format of string?

HongJooChoi · ‎10-03-23

Hi, Community.

Can anyone help us out to find out what VBO(s) and/or Utilities can be used for the following scenario?

1) Open PDF document.

2) Find strings with a specific format through the document.

3) Split the PDF into multiple files according to the position of those strings and save them.

I looked around DX and found the PDF Toolkit Utility but it requires additional license, which is not an option.

PDF Management utility provides only limited functions, i.e., split based on page number only.

Any advice would be much appreciated.

Best regards

------------------------------
HongJoo Choi
------------------------------

Mukeshh_k · ‎10-03-23

Hello HongJoo Choi,

Please refer this DX PDF Management VBO located at: https://digitalexchange.blueprism.com/dx/entry/3439/solution/pdf-management-2, you can set the index upon finding the string to split in the PDF and Split accordingly. I have used this VBO for merging PDFs and it will require you to add PdfSharp.Dll in the automate folder (C:\Program Files\Blue Prism Limited\Blue Prism Automate) along with all the Dll.

------------------------------
Kindly up vote this as "Best Answer" if it adds value or resolves your query in anyway possible, happy to help.

Regards,

Mukesh Kumar - Senior Automation Developer

NHS England, United Kingdom, GB
------------------------------

Regards,

Mukesh Kumar
#MVP

HongJooChoi · ‎10-03-23

Hi, @Mukesh Kumar

Thank you for the response.

I thought that it can split the document only based on the page number.

Could you elaborate a little bit as to how to set the index upon finding the string?

Best regards.

------------------------------
HongJoo Choi
------------------------------

GopalBhaire · ‎12-03-23

Hello,

If your string is midway in the pdf and if you want to split it from there then the above solution doesn't work. Actually I think it is not even possible to split using PDFtoolkit in anyway, as it requires moving text from one page to another.

You can try few solutions:

Maybe open it in Word and see if text is converted and if you are able to split it into multiple files based on the format. (This is very unreliable)
Create custom code using PDF library iText (paid library), which will certainly be able to complete the requirement.
Create a custom python script to handle this process (there are few libraries that can help with it)

Thanks

------------------------------
Gopal Bhaire
------------------------------

Mukeshh_k · ‎20-03-23

Hello HongJoo Choi, Apologies I couldn't keep a track on the follow up questions you had - the above DX Object can split a PDF ( https://digitalexchange.blueprism.com/dx/entry/3439/solution/pdf-management-2) basis on Page Index - So we comedown to the initial part as per suggested solution I mentioned - which was to search in the PDF content for a particular string and locate the Page number and Page Index (PageNumber -1) - above said DX asset uses PDFSharp.Dll which also has the capabilities and functions to do string search, just that you would need to be aware of the functions and how to use them but I already have built something to cater this same requirement with iTextSharp.dll.

Recently we had a similar requirement and have to built a similar object just to return Page number as output basis of specific string we were looking in the PDF, Page index could be calculated further as Page Index = "PageNumber -1" - I will attach the Object and the Dll along with it (iTextSharp) - which will help you to find the Page Number and Page Index for given search of string in the PDF and post that you can simply use the above mentioned DX PDF management Object to split the PDF basis on the Page Index.

Follow Below approach: Code - C#, Library - iTextSharp.dll, Attachments : Object Release and iTextSharp.dll

Code Options:

Let me know if you find any challenges implementing this, happy to help.

------------------------------
Kindly up vote this as "Best Answer" if it adds value or resolves your query in anyway possible, happy to help.

Regards,

Mukesh Kumar - Senior Automation Developer

NHS England, United Kingdom, GB
------------------------------

Regards,

Mukesh Kumar
#MVP

HongJooChoi · ‎21-03-23

Hi, @Mukesh Kumar

Thank you for the attachment.

When testing, the <file Path> and <searchText> have been set and <extractionStrategy> was left the as default = 1.

<pageNumberDT> has been left empty as I have no idea what it would mean.

As a result, it seems to output only the first page, when there are three lines matching the searching text.

Should I have set the <pageNumberDT> in a certain way? or did I miss something?

Best regards

------------------------------
HongJoo Choi
------------------------------

Mukeshh_k · ‎21-03-23

Hi @HongJooChoi - <pageNumberDT> is a validation checker for code and is meant to be left blank. The code will return the page number where the strings have appeared not the number of occurrences of string on the same page. I am not sure if in your case the Searched string is in 3 lines but on the same page which is page number 2.

I did check with one of the PDF (attached is the sample PDF only its not a guide) where "search string = Novelis.io" and "extractionStrategy was left the as default = 1"

Here's the output for that(Returning the Searched Text and their occurrences on page numbers, which can be further calculated for Page Index and passed as Input to Split PDF basis on these calculated Page Index(Page Number-1):

#BPTechTips
------------------------------
Kindly up vote this as "Best Answer" if it adds value or resolves your query in anyway possible, happy to help.

Regards,

Mukesh Kumar - Senior Automation Developer

NHS England, United Kingdom, GB
------------------------------

Regards,

Mukesh Kumar
#MVP

HongJooChoi · ‎22-03-23

Hi, @Mukesh Kumar

Thank you for clarification and efforts.

I appreciate your code that it works in locating the page that cotains the concerned string within the document and outputs the list of corresponding page index.

One thing is that the string may also appear in the middle of the page, in which case the page split should be applied to that position of the string and the tricky part is that there can also be more than one instances of the string on the same page.

Best regards

------------------------------
HongJoo Choi
------------------------------

SS&C Blue Prism Community

How to split PDF based on a specific format of string?