cancel
Showing results for 
Search instead for 
Did you mean: 

Split PDF

Hello Bot Builders,

I want to split a multi PDF invoice. Like 10 invoices are there in a single PDF of 25 pages. Some of the invoice may have 1 page while some can have 2,3 pages as well. I want to split the PDF invoice by invoice.

May be we can split if the Invoice Number changes or something like that but my target is to split invoices correctly no matter how many pages it can have, (not like fixed logic divide fix 1 page or 2 page not like that)

Looking for guidance or insights.

I'm open to any other tool/utility/VBO as well for suggestions.

Thanks much appreciated.


------------------------------
Thanks & Regards,
Tejaskumar Darji
------------------------------
4 REPLIES 4

david.l.morris
Level 15
Let me start by mentioning that what I describe below is with the assumption that you're trying to build this logic on your own. If you have the resources for it, I'd say using a commercial service for this purpose would be easiest, something that already knows how to extract invoices from a large PDF. I don't have any suggestions on that front for which product to use.

If these are scanned PDFs, then you'll have to use some kind of OCR to read the text. Are you already using an OCR solution? If not, I'll assume you're using AWS Textract or a tool that provides similar results. The way I handle this is very dependent on the content of the PDFs. So, unless you use a service that will split it up for you, you'll need to come up with some logic to identify each of the pages and group them together. For example, something I do is to determine if there are one or more sets of phrases in the first or last page of the group. If these invoices can come from any company and could contain any text, then this might be kind of difficult. Let us know if there is some kind of pattern to them, such as set phrases on the first page to look for.

The other thing you can do is to try extracting the page numbers. A lot of times, things like that (invoices etc.) will say 'Page 1 of 6' or just '1 of 6'. You could use regular expressions to extract all the instances of the word 'of' when it is surrounded by two numbers. Do a little string manipulation to determine if the current page is the last one such as '6 of 6' and then start a new group.

Another way to consider is to try reading entity names or something off the pages (might need some kind of NLP/NLU for this) and then try including that into your page grouping logic.

------------------------------
Dave Morris
Cano Ai
Atlanta, GA
------------------------------

Dave Morris, 3Ci at Southern Company

Iike other reply. If the PDF is scanned PDF, You have to use OCR tool to extract.

In my experience. I;ve worked with readble PDF(can CTRL+A to select). Then you can use any tool to convert PDF to text. Then you will found page breaker characters.
after that you have to find word or something that identify its the start page of invoice. then you will known the number of page to split PDF.
then you can use PDF spliter tools to split with specific start/finish page.

I suggest you to use something like command line tool that can use BP VBO to call execution. and you can find for open sources stuff.

------------------------------
Pete Y
------------------------------

Hi
As the other guys have mentioned if its a scanned doc you might have difficulty with it and would need to use OCR. If the doc is just an invoice pdf someone has created in Adobe then I would suggest if you want to split documents it would be better to have BP open your pdf in word. That way the document will be much more manageable and readable.

Hope this helps 🙂


------------------------------
Michael ONeil
Technical Lead developer
Everis Consultancy
Europe/London
------------------------------

ArunGandhi
Level 2
Hey we have given the detailed steps for solving invoice processing using Blue prism here: https://nanonets.com/blog/invoice-processing-using-blueprism-rpa/


------------------------------
Arun Gandhi
------------------------------