cancel
Showing results for 
Search instead for 
Did you mean: 

DFD definition for multiline REGEX

ArturoGarcia
Level 2
Dear Community,

in my organization we are using decipher in some projects. We have find ourselves with a knowledge limitation.

We are trying to extract a phrase frome PDFs. This phrase can be divided into 2,3,4 or five lines, depending on the structure.
We can define a Regular expression that contemplates this issue, but it seems that decipher is not able to execute the Regex through multiple lines, so it doesn´t recognise our intended phrase.

Examples:

9470.png

9471.png
We need a DFD with a regex that extracts the phrase - image 1: "DEMARCACIÓN DE CARRETERAS DEL ESTADO EN CATALUÑA" - image 2: DEMARCACIÓN DE CARRETERAS DEL ESTADO EN CASTILLA Y LEÓN OCCIDENTAL"

We have this regex working: (DEMARCACION)[\s\n\r]*(DE)[\s\n\r]*(CARRETERAS)[\s\n\r]*(DEL)[\s\n\r]*(ESTADO EN)[\s\n\r]*(CATALUÑA|CASTILLA-LA MANCHA) in other languages, but the limitation through lines in decipher doesn´t allow us to succed with it.

Please, if you have an idea to get through this issue, it will help a lot. I have seen other post regarding this problem, but it hasn´t help at all in our case.

Best regardas, have a nice day!

------------------------------
Arturo Garcia
------------------------------
4 REPLIES 4

Ben.Lyons1
Staff
Staff
Hi Arturo,

Is the phrase always near the heading "DIRECCION GENERAL DE CERRETERAS", as this may also be useful without the regex requirement?

Your regex also doesn't appear to use the correct characters e.g. "DEMARCACION" should be "DEMARCACIÓN". Have you tried it this way?

Thanks

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
Blue Prism
UK based
------------------------------
Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

Hello Ben,

thank you for the reply. 

Actyally yes, "DIRECCION GENERAL DE CERRETERAS" is allways near the phrase.

We have tried "DEMARCACIÓN" and all kind of possibilities. We have also defined less restrictive Regex and it looks like decipher is not able to extract info that is divided in lines...

For example, If we create an image that has to lines:

"hello
world" 

And we define a Regex that accepts all characters (spaces an line jumps also) and words, it only gives us the word "hello".... not both.

Waiting your response... Thank you.




------------------------------
Arturo Garcia
------------------------------

Hi Arturo,

Hmm, that shouldn't be a problem, I've had success with using multi-line regex.

I assume you've seen this thread where I demo it's possible? https://community.blueprism.com/discussion/bug-using-regex-in-format-expression#bma20a0e34-b004-41b6-8465-07818380d4cd

Thanks

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
Blue Prism
UK based
------------------------------
Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

I think a newline is missing after ESTADO, each example shows EN at the start of a new line.

(DEMARCACION)[\s\n\r]*(DE)[\s\n\r]*(CARRETERAS)[\s\n\r]*(DEL)[\s\n\r]*(ESTADO EN)[\s\n\r]*(CATALUÑA|CASTILLA-LA MANCHA)

(DEMARCACION)[\s\n\r]*(DE)[\s\n\r]*(CARRETERAS)[\s\n\r]*(DEL)[\s\n\r]*(ESTADO)[\s\n\r]*(EN)[\s\n\r]*(CATALUÑA|CASTILLA-LA MANCHA)

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
Blue Prism
UK based
------------------------------
Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based