Data Extraction from Text

ChakkravarthiPR · ‎19-05-22

Hi Everyone,

I have a data item(Data type is TEXT) that is extracted from PDF, so the text after extracting is like below:

Maverick Sample (May 12, 2022, 13:29 CDT) Maverick Sample

Maverick

Sample

04/30/2022

1 2 3 4 5 6 7 8

4

111 AAA st 1111111111

London YY 78979

sample@email.com

4

05/12/2022

I need to extract and save it to a data item as follows:

Signature: Maverick Sample (May 12, 2022 13:29 CDT) Maverick Sample

First Name: Maverick

Last Name: Sample

Date: 04/30/2022

Employee ID: 1 2 3 4 5 6 7 8

4

Address: 111 AAA st

Phone No: 1111111111

City: London

State: YY

Zip Code: 78979

Email: sample@email.com

4

Separation Date: 05/12/2022

Here 4 4 4 appearing between the lines come from PDF after considering the file as text. Any help will be highly appreciable.

@devneetmohanty07 FYI

Thanks!

------------------------------
Chakkravarthi PR
------------------------------

RushabhDedhia · ‎19-05-22

Hey @ChakkravarthiPR,

You can create a custom code in the "Code" stage for the same (C# VB.net).

I had used regex to create groups from the result of the string I was getting.

The only condition you require for creating the regex is the pattern of the input string should always remain the same.

You can use the website Regex101.com for creating the regex pattern.
(Note you can also eliminate the unrequired characters which are coming in your input string)

Regards

------------------------------
Rushabh Dedhia
Senior Consultant - Team Lead
WonderBotz LLC
Ahmedabad
+91 9428860307
------------------------------

Rushabh Dedhia Founder, Biznessology (https://www.linkedin.com/company/biznessology/) +91 9428860307

RamónRequena_L1 · ‎19-05-22

In DX there is a VBO with RegEx functionalities that will save u having to create a Custom VBO for RegEx handling. I had started my own VBO and ended up replacing it for the one below.

https://digitalexchange.blueprism.com/dx/entry/3593/solution/avoregex

------------------------------
Ramón Requena López
RPA Developer
Magenta Telekom
------------------------------

AtyantSrivastav · ‎19-05-22

Use this.

Please note it is customised as per the text provided by you any change in the text pattern will throw exception.

add System.IO in code option of main page

------------------------------
Atyant Srivastava
Team lead
Personal
Asia/Kolkata
------------------------------

MichealCharron · ‎19-05-22

@Chakkravarthi PR,

You could also use BP's Utility - Strings VBO which has a "Regex Replace" action which can do the trick?

As @Rushabh Dedhia pointed out, you should be sure that the PDF will always produce that data you are looking for. If the data is produced in a PDF from a form with proper validation, this usually is not a problem. If the data is not in that format for a minority of cases, you can always do a quick check ("Test Regex Match" action) to see if the Regex pattern matches and throw an exception if it doesn't.

Before

After

Search Pattern:

(.*)[\r\n]+(.*)[\r\n]+(.*)[\r\n]+(\d{2}/\d{2}/\d{4})[\r\n]+([\d\s]*?)[\r\n]+.*?[\r\n]+(.+)\s(\S+)[\r\n]+(.*)\s(.*)\s(\S+)[\r\n]+(.*\@.*\..*)[\r\n]+.*?[\r\n]+.*?[\r\n]+(\d{2}/\d{2}/\d{4})

Replacement Pattern:

Signature: $1
First Name: $2
Last Name: $3
Date: $4
Employee ID: $5
Address: $6
Phone No: $7
City: $8
State: $9
Zip Code: $10
Email: $11
Separation Date: $12

------------------------------
Micheal Charron
Senior Manager
RBC
America/Toronto
------------------------------

Micheal Charron
RBC
Toronto, Ontario
Canada

SS&C Blue Prism Community

Data Extraction from Text