cancel
Showing results for 
Search instead for 
Did you mean: 

Bug using regex in format expression

OroelIpas
Level 2
Hi,
 
I want to show a wrong behavior of decipher when working with regular expressions.

I am working with decipher to extract information from ID documents, so I covered the sensitive information in all the screenshots I attach. Here is one <document example:
9017.png

I want to extract the name of the person (all the words covered by white), so the header of the field is "NOMBRE". In order to avoid Decipher to extract the alphanumeric code covered by blue I wrote this Regex:
([A-ZÁÉÍÓÚÜÑ]+[\n ]+[A-ZÁÉÍÓÚÜÑ]+)([\n ]+[A-ZÁÉÍÓÚÜÑ]+)*
The regex makes decipher extract something that has two or more words (with all the Spanish characters, but not allowing number), separated by spaces or newlines.

As shown in the first screenshot, decipher has not extracted the second line of the name (it is a multiline field), so I manually reshaped the box of the field. After doing this the validation of the field fails and the box turns red even though the data should fix the regex.

9018.png

The way I found to fix this is:
  1. Click inside the field
  2. Modify the data inside (e.g. remove a character)
  3. Click out of the field
  4. Now the field turns green showing the data format is valid
  5. Click inside the field and undo the modification I made (introduce the character I deleted)
With these steps the field now contains the same data by decipher can see its format is valid
9019.png
Can someone explain this behavior? 
It is not a big deal when doing manual data verification, but I see it can become a big problem if there is a bug with the regex when running decipher in autonomous mode.

Thanks in advance for any help


------------------------------
Oroel Ipas
------------------------------
4 REPLIES 4

Ben.Lyons1
Staff
Staff
Hi Oroel,

I've just spent some time looking into this behaviour using similar field. I recall previously having some difficulty with multi-line fields and Regex, but it was possible.

It's due to an interaction between the multi-line flag and Regex for extracting over multiple lines. I deselected the multi-line flag and changed the greedy marker "+" after each of the new line characters to "*" as 0 needs to be an option for it to work.

([A-ZÁÉÍÓÚÜÑ]+[\n ]+[A-ZÁÉÍÓÚÜÑ]+[\n ]*)([\n ]*[A-ZÁÉÍÓÚÜÑ]+)*

Single line field

8992.png

Multiline field
8993.png

Give this a try, if you haven't already, and let me know how you get on. This issue has already been raised with the development team.

Thanks

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
Blue Prism
UK based
------------------------------
Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

Hi Oroel,

I have an update from the development team on this matter.

There is a deliberate difference how Decipher handles multi-line fields compared with single line fields which affects how the Regex match is used. This is by design and supports the extraction of multi-line fields.

There's a simple change that can be made to your Regex to enable its use with a multi-line field, just by adding "\r" before "\n" as they will appear together.

E.g. ([A-ZÁÉÍÓÚÜÑ]+[\r\n ]+[A-ZÁÉÍÓÚÜÑ]+[\r\n ]*)([\r\n ]*[A-ZÁÉÍÓÚÜÑ]+)*

And using the same example from above.
8997.png
Let me know how you get on.

Thanks


------------------------------
Ben Lyons
Senior Product Specialist - Decipher
Blue Prism
UK based
------------------------------
Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

Hi Ben,

Thanks for your quick response and your help. I tried the new regex you provided, and it worked.

I still have a question about how the Multiline flag affects extraction.
Does the flag help Decipher to choose more than one line of data? After training ~20 documents with the Multiline flag Decipher still takes only the first line of the data.

How can I force Decipher to extract always more than one line if I know one specific field is always multiline? One of the fields I want to extract always has three lines. My original idea was to use a regex that includes as many "\n" as I expect the field to have. Any help with this?

In some document types, each line contains different data and I want to keep the "\n" to know which line is which, but in other cases extracting all the data in one line is 100% ok, do you still recommend using the Multiline flag in these cases?
9003.png
Thank you very much for your help,

Oroel Ipas.

------------------------------
Oroel Ipas
------------------------------

Hi Oroel,

The multi-line flag mostly changes how it's displayed in the verification screen and maintains the line breaks in the export. I don't believe it has a significant effect on the training, at least not more so than the region selection by the user.

I'm not sure there is a way to specify a number of lines or force it in any way. Though it might be worth trying to train 4 separate fields and combine them in a 5th field with a formula. Before doing this, export your training data and keep it somewhere safe, then delete the training data in the app as this will speed up the new training.

Thanks

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
Blue Prism
UK based
------------------------------
Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based