Box type fields in PDF

BernardoCris012 · ‎04-06-24

Hi Community,

Just to ask if you have any approach in working with PDF that the text field is box type. Blue prism were able to read the text however if you trim the value it will remove all the spaces and become as one word.

Parthiban_Viatris24 · ‎06-06-24

Try creating logic to split text based on Spaces and use loop to concatenate the chars parallelly verify the character is upper or lower case to differentiate the last name starting char.

Parthiban A

devneetmohanty07 · ‎06-06-24

Hi @BernardoCris012 ,

Getting a text from PDF file that too when you are using a box type field will vary heavily based on the type of extraction method you are using ranging from PDF extraction DLL's like iTextSharp or OCR solutions like Adobe Document Services, normally copy pasting the value to a clipboard or using any IDP solution like Abby.

Also, the algorithm used to create the PDF file will significantly alter the structure in which you can get this text out if you use normal extraction methods.

However, what is straightforward in your case is the data that you are getting where we can use a logic to get the First Name and Last Name if that is your requirement. Since I can see that there is a pattern where First Name starts with a Capital Letter and ends till we encounter another Capital Letter which basically tells us the start of the Last Name, hence we can use a Regex expression for the same.

I built a sample workflow for getting these two data points out as shown below:

So first I am using the Extract Regex Values action from the Utility - Strings business object. Here you can pass the Input Text which you are getting out of PDF along with a Regex Pattern which should be as follows:

^(?<first_name>[A-Z][a-z ]*)(?<last_name>[A-Z][a-z ]*)$

Also, we need to create a collection called Named Values having two fields of Text Type called Name and Value and we will initially add two rows such that the 'Name' field has the values: first_name and last_name as shown below:

Now, we pass all these parameters to our action and get the output collection in the same Named Values collection:

Now, when we execute this action, we will get the extracted values for First Name and Last Name in the Named Values collection as shown below:

Now, we will use Filter Collection action which I have named as Get First Name and here I will pass the Named Values collection along with the filter query as "[Name] = 'first_name'" and store this value to the Filtered Collection collection:

When we will execute this action, we should get the Filtered Collection with only the row containing the 'first_name' value:

Now, we can use a Calculation stage and replace the space characters with an empty string and store it to the First Name data item:

Similarly, we will now use the Filter Collection action and calculation stage for the Last Name as well and we should be able to get the Last Name data item value:

---------------------------------------------------------------------------------------------------------------------------------------
Hope this helps you out and if so, please mark the current thread as the 'Answer', so others can refer to the same for reference in future.
Regards,
Devneet Mohanty,
SS&C Blueprism Community MVP 2024,
Automation Architect,
Wonderbotz India Pvt. Ltd.

CrissyRPA_Bernardo · ‎17-06-24

Hi, the problem is it depends on the client if the company name is all-uppercase/lowercase since it is a field for their company name in the PDF file there is no rule that it every word starts with upper case.

BernardoCris012 · ‎17-06-24

Hi @devneetmohanty07 , I like the idea on how we can get the first name and last name in that case and I know I can use it in the future. However, the current challenge is that since this is a company name there are no rules on how you can put data in the company name field. Sometimes the attachment are in all-uppercase/lowercase , sometimes only the first letter of the word is in uppercase. We have VBO action which can read pdf details but if the box is empty it is being eliminated in the data instead of being an "whitespace"

devneetmohanty07 · ‎17-06-24

Hi @BernardoCris012 ,

Yes definitely it is based on the rule considering this pattern is followed or not. As I mentioned in the post, this inherently is more of a limitation of the VBO that you are using to read the PDF file. You can let us know which VBO or what DLL references you are using for this use case (For example, iTextSharp, PDFSharp, Doctif etc.). If you provide me that I can look more into it but cannot guarantee the results.

Also, my suggestion would be to discuss other solutions with your clients if possible, such as Adobe Document Services or consider an IDP solution as boxes are vector shapes that many digital reading libraries cannot handle or identify which seems to be in your case. However, most of these are licensed tool and would add additional costs.

---------------------------------------------------------------------------------------------------------------------------------------
Hope this helps you out and if so, please mark the current thread as the 'Answer', so others can refer to the same for reference in future.
Regards,
Devneet Mohanty,
SS&C Blueprism Community MVP 2024,
Automation Architect,
Wonderbotz India Pvt. Ltd.

SS&C Blue Prism Community

Box type fields in PDF