Data Extraction from Text
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
19-05-22 10:18 AM
Hi Everyone,
I have a data item(Data type is TEXT) that is extracted from PDF, so the text after extracting is like below:
Maverick Sample (May 12, 2022, 13:29 CDT) Maverick Sample
Maverick
Sample
04/30/2022
1 2 3 4 5 6 7 8
4
111 AAA st 1111111111
London YY 78979
sample@email.com
4
4
05/12/2022
I need to extract and save it to a data item as follows:
Signature: Maverick Sample (May 12, 2022 13:29 CDT) Maverick Sample
First Name: Maverick
Last Name: Sample
Date: 04/30/2022
Employee ID: 1 2 3 4 5 6 7 8
4
Address: 111 AAA st
Phone No: 1111111111
City: London
State: YY
Zip Code: 78979
Email: sample@email.com
4
4
Separation Date: 05/12/2022
Here 4 4 4 appearing between the lines come from PDF after considering the file as text. Any help will be highly appreciable.
@devneetmohanty07 FYI
Thanks!
------------------------------
Chakkravarthi PR
------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
19-05-22 12:31 PM
Hey @ChakkravarthiPR,
You can create a custom code in the "Code" stage for the same (C# VB.net).
I had used regex to create groups from the result of the string I was getting.
The only condition you require for creating the regex is the pattern of the input string should always remain the same.
You can use the website Regex101.com for creating the regex pattern.
(Note you can also eliminate the unrequired characters which are coming in your input string)
Regards
------------------------------
Rushabh Dedhia
Senior Consultant - Team Lead
WonderBotz LLC
Ahmedabad
+91 9428860307
------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
19-05-22 01:27 PM
https://digitalexchange.blueprism.com/dx/entry/3593/solution/avoregex
------------------------------
Ramón Requena López
RPA Developer
Magenta Telekom
------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
19-05-22 01:46 PM
Please note it is customised as per the text provided by you any change in the text pattern will throw exception.
add System.IO in code option of main page
------------------------------
Atyant Srivastava
Team lead
Personal
Asia/Kolkata
------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
19-05-22 02:01 PM
@Chakkravarthi PR,
You could also use BP's Utility - Strings VBO which has a "Regex Replace" action which can do the trick?
As @Rushabh Dedhia pointed out, you should be sure that the PDF will always produce that data you are looking for. If the data is produced in a PDF from a form with proper validation, this usually is not a problem. If the data is not in that format for a minority of cases, you can always do a quick check ("Test Regex Match" action) to see if the Regex pattern matches and throw an exception if it doesn't.
Before
After
Search Pattern:
(.*)[\r\n]+(.*)[\r\n]+(.*)[\r\n]+(\d{2}/\d{2}/\d{4})[\r\n]+([\d\s]*?)[\r\n]+.*?[\r\n]+(.+)\s(\S+)[\r\n]+(.*)\s(.*)\s(\S+)[\r\n]+(.*\@.*\..*)[\r\n]+.*?[\r\n]+.*?[\r\n]+(\d{2}/\d{2}/\d{4})Replacement Pattern:
Signature: $1
First Name: $2
Last Name: $3
Date: $4
Employee ID: $5
Address: $6
Phone No: $7
City: $8
State: $9
Zip Code: $10
Email: $11
Separation Date: $12
------------------------------
Micheal Charron
Senior Manager
RBC
America/Toronto
------------------------------
RBC
Toronto, Ontario
Canada
