Extracting first sentence from a paragraph
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-09-17 11:47 AM
Hi,
Is there a way we can extract first sentence from a paragraph. Can regex be used here. if yes How?
say for example the paragraph below has two sentences, and I need first sentence:
The Japanese loan will be available at 0.1% interest rate on Oct. 25 and India will be able to repay this in 50 years. Repayment will begin 15 years after the loan is received.
My Desired output: The Japanese loan will be available at 0.1% interest rate on Oct. 25 and India will be able to repay this in 50 years.
how can i do that?
Regards
Karan
2 REPLIES 2
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-09-17 02:50 PM
The easiest option would be.
InStr([text], "". "") - This will output a [character number] when the next sentence starts;
Left([text], [character number]) - This will extract a text (preferably into [new sentence data item]
Replace([text], [new sentence data item], """") - This will replace sentence one in [text] so you can move on to the next one, if required.
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
19-09-17 08:37 PM
Karan,
Please understand, there is no simple solution here that would give 100% accurate results. To understand what makes up a sentence, in a set of rules that can be given to computer software without any cognitive understanding is pretty hard. There are cognitive tools available that can provide some syntactic analysis of a document, but might be overkill for what you are looking to achieve, especially if you want a pure Blue Prism solution.
As Ivan has demonstrated, you can get pretty close by looking for the first period, question mark, or exclamation mark followed by white-space.
Regex: /^(.*?)[.?!]\s/
In Blue Prism you could keep it simple with the expression, Trim(Replace([text],Left([text],InStr([text],"". ""))),"""") However, given your use case in your post, this would yield,
""The Japanese loan will be available at 0.1% interest rate on Oct.""
You would also need to consider the remaining characters that suggest the end of a sentence.
Comparing the integer output from, InStr([text],""! ""), InStr([text],"". "") and InStr([text],""? "") to find the expression the outputs the lowest value, before performing the full expression to manipulate the text.
You also need to consider what if [text] is only 1 sentence. You would need to do a decision stage to check if there is more than 1 period (e.g. InStr([text],""."")>1). You then also need to consider what if there are no periods in the data item.
Even if you check that the first character after "". "" is a lowercase character, meaning that this is still part of the first sentence, won't be accurate as, again, in your use case this is a numeric value, which doesn't indicate if it is still part of the first sentence, or a separate sentence.
Tom
