19-05-22 10:18 AM
Hi Everyone,
I have a data item(Data type is TEXT) that is extracted from PDF, so the text after extracting is like below:
Maverick Sample (May 12, 2022, 13:29 CDT) Maverick Sample
Maverick
Sample
04/30/2022
1 2 3 4 5 6 7 8
4
111 AAA st 1111111111
London YY 78979
sample@email.com
4
4
05/12/2022
I need to extract and save it to a data item as follows:
Signature: Maverick Sample (May 12, 2022 13:29 CDT) Maverick Sample
First Name: Maverick
Last Name: Sample
Date: 04/30/2022
Employee ID: 1 2 3 4 5 6 7 8
4
Address: 111 AAA st
Phone No: 1111111111
City: London
State: YY
Zip Code: 78979
Email: sample@email.com
4
4
Separation Date: 05/12/2022
Here 4 4 4 appearing between the lines come from PDF after considering the file as text. Any help will be highly appreciable.
@devneetmohanty07 FYI
Thanks!
19-05-22 12:31 PM
Hey @ChakkravarthiPR,
You can create a custom code in the "Code" stage for the same (C# VB.net).
I had used regex to create groups from the result of the string I was getting.
The only condition you require for creating the regex is the pattern of the input string should always remain the same.
You can use the website Regex101.com for creating the regex pattern.
(Note you can also eliminate the unrequired characters which are coming in your input string)
Regards
19-05-22 01:27 PM
19-05-22 01:46 PM
19-05-22 02:01 PM
@Chakkravarthi PR,
You could also use BP's Utility - Strings VBO which has a "Regex Replace" action which can do the trick?
As @Rushabh Dedhia pointed out, you should be sure that the PDF will always produce that data you are looking for. If the data is produced in a PDF from a form with proper validation, this usually is not a problem. If the data is not in that format for a minority of cases, you can always do a quick check ("Test Regex Match" action) to see if the Regex pattern matches and throw an exception if it doesn't.
(.*)[\r\n]+(.*)[\r\n]+(.*)[\r\n]+(\d{2}/\d{2}/\d{4})[\r\n]+([\d\s]*?)[\r\n]+.*?[\r\n]+(.+)\s(\S+)[\r\n]+(.*)\s(.*)\s(\S+)[\r\n]+(.*\@.*\..*)[\r\n]+.*?[\r\n]+.*?[\r\n]+(\d{2}/\d{2}/\d{4})Replacement Pattern:
Signature: $1
First Name: $2
Last Name: $3
Date: $4
Employee ID: $5
Address: $6
Phone No: $7
City: $8
State: $9
Zip Code: $10
Email: $11
Separation Date: $12