Regex syntax to extract multiple rows

tetiana.laptieva · ‎22-12-21

How to catch text which starts and ends with predefined words and has multiple rows in between?

Position of this text is dynamic on page. This text can be present multiple times in doc (incl. on different pages). Unique text is in between Start and End words.

Example on 1 page:

Start of text I need to take ke[pkg;gn’;fgn unique text 1

continue unique text 1

Kmbkfdmb End.

Start of text I need to take phtrhrttn,n flm;flng;lfgn unique text 2

continue unique text 2

gfknjfogknmfg End.

Or can be:

1th page

Start of text I need to take ke[pkg;gn’;fgn unique text 1

continue unique text 1

Kmbkfdmb End.

2^nd page:

Start of text I need to take phtrhrttn,n flm;flng;lfgn unique text 2

continue unique text 2

gfknjfogknmfg End.

I need extract all of them.

Due to the dynamic position I used StrictRegex Flag and Format expression only.

Regex like Start(.*?|\n)*?End.* OR ^Start:(.*?|\n)*?End.*$ - raise CPU issue:

It has same issue even when I want to extract only 1 field (unique regex in DFD and unique text in PDF)

With disabled flag StrictRexeg – the same issue.

Regex like Start[\s\S]*?End.* OR ^Start[\s\S]*?End.* $ - has low confidence (text is red).

It works only for some already trained docs (not all). I processed about 80 docs (5 original docs) - new batch – submit in the end, and still the text is in red. Not in all doc it catches (even after 15 times submit butch). The text is combined in one correct region but field not filled automatically. I select it manually, submit batch, send this doc again. And nothing has changed.
Thanks

Ben.Lyons1 · ‎22-12-21

Hi Tetiana,

If the number of occurrences differs between documents, you may have difficulty extracting multiple instances of a set regex string. This would indicate the document is unstructured and not a suitable use for this kind of feature.

Decipher can handle unstructured documents with the NLP function, but you would need to use a different methodology for locating the respective text.

I'm not sure what you mean by CPU issue as the image is quite small. I would expect that when you're using Regex to locate a value, the capture clients will have to work hard against some potentially large strings. You will see multiple capture clients working, as this is the multi-threading functionality working on different pages.

Also, StrictRegex as a flag, is only needed if you want to ignore the location of the data in the document and any sample headers.

Let me know if this helps.

Thanks

Ben

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

tetiana.laptieva · ‎22-12-21

hey Ben,
thanks for help

this ((.|\n)*) works similar [\s\S]* - with low confidence (text in red). Does it make sense to continue training?
i'm asking, because in best practice low confidence is mentioned like a reason to rebuild dfd.
Is it possible to catch what i need in other way?

As for CPU, with regex (.*?|\n)* all CPUs on machine are on 100% and it lasts about 6 hours. Currently we have only 1 solution how to stop it - restart server.

Ben.Lyons1 · ‎22-12-21

Hi Tetiana,

I don't think continuing to train the document will remove the low confidence indicator in this instance. This may continue to trigger where additional text is within the region you're reading, we're looking into this as a future improvement.

You could use the CCL misc parameter to potentially prevent it from occurring, e.g. CCL=50 and if you're happy with your Regex, you should still get the results you're after.

Using that specific Regex will likely work the CPU hard. Decipher's using the string to match against all the text in the document. So if you have a lot of text and a lot of pages, you will see a spike like this for quite a while. As I say, it sounds like it may be more suitable for the unstructured document processing model.

Thanks

Ben

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

tetiana.laptieva · ‎22-12-21

As for strictRegex flag:
i use it because text has dynamic position. I don't use other settings in dfd.
in case i need to find text in single row - StrictRegex Flag + regex epression works perfect. It catches correctly from scratch.

It works well also, when i need to catch single row which is not unique in text. In dfd i have few fields with Strict regex flag and the same regex epression like Start.*
it catches all matched values.
But once i add multiline, it doesn't work

Ben.Lyons1 · ‎22-12-21

Hi Tetiana,

The strict regex flag is more designed for a unique identifier, that could appear anywhere in a document. Like a policy number with a specific format, that perhaps doesn't say "Policy Number" next to it.

If you'd like us to see how we might be able to better work with your document/scenario, please feel free to contact your BP account manager or raise a request for an Expert Connect session.

Thanks

Ben

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

Ben.Lyons1 · ‎23-12-21

Hi Tetiana,

I've had another thought regarding the high CPU utilization. Does your Document Type have ML enabled? If so, is it marked for training or set for periodic training? And what is the training size?

Thanks

Ben

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

SS&C Blue Prism Community

Regex syntax to extract multiple rows