cancel
Showing results for 
Search instead for 
Did you mean: 

Data Extraction from Text

ChakkravarthiPR
Level 3

Hi Everyone,

I have a data item(Data type is TEXT) that is extracted from PDF, so the text after extracting is like below:

Maverick Sample (May 12, 2022, 13:29 CDT) Maverick Sample

Maverick

Sample

04/30/2022

1 2 3 4 5 6 7 8

4

111 AAA st 1111111111

London YY 78979

sample@email.com

4

4

05/12/2022

I need to extract and save it to a data item as follows: 

Signature: Maverick Sample (May 12, 2022 13:29 CDT) Maverick Sample

First Name: Maverick

Last Name: Sample

Date: 04/30/2022

Employee ID: 1 2 3 4 5 6 7 8

4

Address: 111 AAA st

Phone No: 1111111111

City: London

State: YY

Zip Code: 78979

Email: sample@email.com

4

4

Separation Date: 05/12/2022


Here 4 4 4 appearing between the lines come from PDF after considering the file as text. Any help will be highly appreciable.


@devneetmohanty07 FYI

Thanks!



------------------------------
Chakkravarthi PR
------------------------------
4 REPLIES 4

RushabhDedhia
Level 4

Hey @ChakkravarthiPR,

You can create a custom code in the "Code" stage for the same (C# VB.net).

I had used regex to create groups from the result of the string I was getting.

The only condition you require for creating the regex is the pattern of the input string should always remain the same. 

You can use the website Regex101.com for creating the regex pattern. 
(Note you can also eliminate  the unrequired characters which are coming in your input string)

Regards



------------------------------
Rushabh Dedhia
Senior Consultant - Team Lead
WonderBotz LLC
Ahmedabad
+91 9428860307
------------------------------
Rushabh Dedhia Founder, Biznessology (https://www.linkedin.com/company/biznessology/) +91 9428860307

In DX there is a VBO with RegEx functionalities that will save u having to create a Custom VBO for RegEx handling. I had started my own VBO and ended up replacing it for the one below. 

https://digitalexchange.blueprism.com/dx/entry/3593/solution/avoregex

------------------------------
Ramón Requena López
RPA Developer
Magenta Telekom
------------------------------

AtyantSrivastav
Level 4
Use this.

Please note it is customised as per the text provided by you any change in the text pattern will throw exception.

add System.IO in code option of main page

------------------------------
Atyant Srivastava
Team lead
Personal
Asia/Kolkata
------------------------------

@Chakkravarthi PR,

You could also use BP's Utility - Strings VBO which has a "Regex Replace" action which can do the trick?

As @Rushabh Dedhia  pointed out, you should be sure that the PDF will always produce that data you are looking for. If the data is produced in a PDF from a form with proper validation, this usually is not a problem. If the data is not in that format for a minority of cases, you can always do a quick check ("Test Regex Match" action) to see if the Regex pattern matches and throw an exception if it doesn't.

20462.png
Before
20463.png
After
20464.png
Search Pattern:
(.*)[\r\n]+(.*)[\r\n]+(.*)[\r\n]+(\d{2}/\d{2}/\d{4})[\r\n]+([\d\s]*?)[\r\n]+.*?[\r\n]+(.+)\s(\S+)[\r\n]+(.*)\s(.*)\s(\S+)[\r\n]+(.*\@.*\..*)[\r\n]+.*?[\r\n]+.*?[\r\n]+(\d{2}/\d{2}/\d{4})​
​Replacement Pattern:
Signature: $1
First Name: $2
Last Name: $3
Date: $4
Employee ID: $5
Address: $6
Phone No: $7
City: $8
State: $9
Zip Code: $10
Email: $11
Separation Date: $12


------------------------------
Micheal Charron
Senior Manager
RBC
America/Toronto
------------------------------
Micheal Charron
RBC
Toronto, Ontario
Canada