Split text based on certain characters

HARSHVERMA · ‎17-01-22

Hi all,
I have a text that contains different file names along with their extensions(pdf, docx,doc). I want to seperate them individually . How can I do that? For example :
Text: L1234ty.pdfL1244re.docxL1221ytr.doc
I need
L1234ty.pdf
L1244re.docx
L1221ytr.doc

The pattern of file names can vary everytime.

------------------------------
HARSH VERMA
------------------------------

HARSHVERMA · ‎18-01-22

Hi Devneet, Eric, Kishore, A very much appreciable thanks to all for the help. Special thanks to Devneet for providing real demonstration. I am testing all the steps,
The shortcoming in Kishore's solution can be rectified if I add another calculation stage after replacing doc;x by docx again. Pleae let me know if you think of any further shortcoming here.

Please refer below pic

------------------------------
HARSH VERMA
------------------------------

View answer in original post

ewilson · ‎17-01-22

@HARSHVERMA,

First, you need to determine if there's a repeatable pattern that you can use. In your examples each of your file names starts with a capital "L". Will that be the case in production? Will pdf, doc, and docx be the only file extensions you'll be dealing with? Are you sure every file name in the original string will have an extension?

Cheers,

------------------------------
Eric Wilson
Director, Integrations and Enablement
Blue Prism Digital Exchange
------------------------------

devneetmohanty07 · ‎17-01-22

Hi Harsh,

According to your use case, the best way to go around this is to use Regular Expressions. For extracting keywords having pdf, doc and docx I would suggest using separate regular expressions instead of a single regular expression. Please find the below regular expressions which you can use:

- For PDF : (?>docx|doc|\b)(?<PDF>\w+\.pdf)
- For DOC : (?>pdf|docx|\b)(?<DOC>\w+\.doc)(?!x)
- For DOCX : (?>pdf|doc|\b)(?<DOCX>\w+\.docx)

Now in order to use these regular expressions, I would suggest using the 'Extract Regex Values' action from 'Utility - Strings' VBO. In order to use this VBO, you first need to have a regular expression ready which we have prepared above.

Next, we require a target string which is the text from where the keywords need to be extracted which in our case is: L1234ty.pdfL1244re.docxL1221ytr.doc

In the end, we also require a collection which has two columns by default called as 'Name' and 'Value' both of the type 'Text'. You can name this collection anything you want in my case I have named it as 'RegexReturn'. This collection must have the name of the group defined when you are creating it as an Initial value. In our case, since we are working on the Regular Expression for PDF first we will be using the group name that is highlighted in our regular expression : (?>docx|doc|\b)(?<PDF>\w+\.pdf)
which is 'PDF'.

Below you can find the screenshot of how this collection you need to create:

Once the collection is ready, assign the same collection as the output of the action, 'Extract Regex Values' as well. The action will look something like this:

Upon successfully executing the action, you can see that we get the Value field populated against the Group name that we set:

You need to repeat the same steps in order to create two more similar actions for the keywords DOC and DOCX where the only change would be in the regular expression and the group name. Also, for these two actions you need to create two separate collections where the respective group name value will be stored.

NOTE: However, there is a shortcoming with this solution that has been provided by default which is in case the file name PDF is occurring multiple times, then only the first occurrence is extracted. For example, if the test string would have been: L1234ty.pdfL1244re.docxL1221ytr.docL123.pdf . Here you can see that there are two files with the name ending with .pdf, so in this case using the above approach you will only be able to extract L123ty.pdf which occurs first.

Hence, I would only suggest to use this action directly if the keyword for PDF will only occur once which can be seen in the original test string provided by you.

Enhanced Solution:

If you want multiple occurrences to be picked up, you can replace the extracted text in the original text and then again run this action. So this action will basically run in an iterative loop where you first extract the text, then check if the text extracted is blank or not. If it is not blank, it means that the text was available in the current iteration so you can keep storing the extracted value in some other result collection and then replace the extracted text in the original text with a blank values. If in any iteration, you get the extracted text value as blank, then it means no more occurrences are available and then you can simply exit the iterative loop.

The same steps will be followed separately for each keyword value. The overall workflow for the enhanced solution will look something like this:

------------------------------
----------------------------------
Hope it helps you and if it resolves you query please mark it as the best answer so that others having the same problem can track the answer easily

Regards,
Devneet Mohanty
Intelligent Process Automation Consultant
Blue Prism 7x Certified Professional
Website: https://devneet.github.io/
Email: devneetmohanty07@gmail.com

----------------------------------
------------------------------

---------------------------------------------------------------------------------------------------------------------------------------
Hope this helps you out and if so, please mark the current thread as the 'Answer', so others can refer to the same for reference in future.
Regards,
Devneet Mohanty,
SS&C Blueprism Community MVP 2024,
Automation Architect,
Wonderbotz India Pvt. Ltd.

Kishore_KumarDe · ‎18-01-22

Hi Harsh,

The easiest solution that I can think of is put all the file patters(pdf,docx,doc) into a collection then perform below steps:
1. Loop through this collection of file patterns

2. Replace that file pattern with an additional delimiter( ;). So you replace "pdf" with "pdf;" and then "docx" with "docx;" and so on.
3. Use another replace to convert doc;x to docx.
4. Use string utility and split the string with your delimiter(;).

Note: docx should always be replaced before doc. Verify such scenario for all the file patterns.
Limitation: you must know all the possible file patterns in advance.
Suggestions: you can put all the possible file patterns in env variable so that it is easier to update without code in case you find a new pattern

Hope that helps. Enjoy!!

------------------------------
Kishore Deka
Lead Software Engineer
EPAM systems
------------------------------

If my answer provided any assistance, please vote as "Best Answer". Kishore Deka Lead Software Engineer EPAM systems Connect on LinkedIn https://www.linkedin.com/in/kishoredeka1410/

HARSHVERMA · ‎18-01-22

Not always starts with L. All files will have an extension pdf,doc,docx.

yes it has these 3 extensions. Yes it must have extensions.

------------------------------
HARSH VERMA
------------------------------

ewilson · ‎18-01-22

Hi @HARSHVERMA,

Ok. I think both solutions provided below will get you most of the way, but there are shortcomings in both solutions. Devneet has called out the issues with his RegEx design. As for Kishore's solution, you'll run into an issue when you try to match on .doc after having matched on .docx. In other words, if we take the example you provided and we think through how the replacement will work the final string will look something like this:

L1234ty.pdf;L1244re.doc;x;L1221ytr.doc;

Notice the semicolon inserted between the "c" and "x" on "docx". So that doesn't quite work. You can do this with existing Calculation and Decision stages in Blue Prism though. I've attached a simple example process that I believe addresses your requirements as we know them now.

The general idea is (working from the left to right):

Locate the position of the first period.
Extract all characters up to, and including, the period. This will be the root of the file name.
Remove those characters from the work data.
Now determine if the first few characters of the work data are "pdf", "docx", or "doc", append the appropriate extension to the root file name and write the full file name to the File Names collection.
Remove the extension information from the front of the work data.
Repeat 1-5 until no further periods are found.

Cheers,

------------------------------
Eric Wilson
Director, Integrations and Enablement
Blue Prism Digital Exchange
------------------------------

HARSHVERMA · ‎18-01-22

Hi Devneet, Eric, Kishore, A very much appreciable thanks to all for the help. Special thanks to Devneet for providing real demonstration. I am testing all the steps,
The shortcoming in Kishore's solution can be rectified if I add another calculation stage after replacing doc;x by docx again. Pleae let me know if you think of any further shortcoming here.

Please refer below pic

------------------------------
HARSH VERMA
------------------------------

Kishore_KumarDe · ‎18-01-22

Yes Harsh and Eric.

I missed that doc;x point but anyways you have an easy workaround for that. And if those are the only formats you will have than probably that will be easiest solution which should always work.

------------------------------
Kishore Deka
Lead Software Engineer
EPAM systems
------------------------------

If my answer provided any assistance, please vote as "Best Answer". Kishore Deka Lead Software Engineer EPAM systems Connect on LinkedIn https://www.linkedin.com/in/kishoredeka1410/

SS&C Blue Prism Community

Split text based on certain characters