Monday
Hello, everyone,
I’m currently working on automating the processing of invoice documents using Decipher IDP.
However, I’ve encountered an issue where data extraction fails for a specific field on the document.
As shown in the screenshot, a bounding box is drawn around the invoice number, but the data itself isn’t extracted.
Interestingly, if I manually click the Refresh Region button, the correct value is extracted with 100% accuracy.
(In other words, the field is correctly read only when someone intervenes to press the Refresh Region button.)
I’d like to know if there’s a way to resolve this issue.
I made a DFD definition like below
ID : FT_1_USER_FIELD
Format : Text
Flags : Assignable, Required, AutoCalculate
Format Expression : ^\d{3}-?\d{2}-?\d{5}$ <---- an expression for tax number
Dependent Items : FT_1_USER_FIELD
Formula : STRREPLACE(FT_1_USER_FIELD, " ", "") <---- Aims To get rid of empty space
If anyone has encountered and solved a similar problem, I’d greatly appreciate your advice. Thank you!
Monday
Hi @Sangjun ,
I would recommend removing the formula and deselecting the Auto-Calculate flag as there is potentially a more efficient way to remove the unwanted space.
You can use the misc parameter "RegexMode" and set it to "2", which uses a fuzzy matching method and can automatically remove unwanted spaces.
You can see in my example where I'm using the expression [A-Z]{2}[0-9]{5} that the space is causing an error.
I then set the Regex Mode to 2.
And retry the same document, this time the space is removed by Decipher.
Ensure you test this with other documents as other characters can be replaced e.g. o can be changed to 0.
Also when using a formula like this, it's best to use the special variable SELF e.g. STRREPLACE(SELF, " ", "").
Kind Regards
yesterday
Dear Ben Lyons,
Even after applying the method you suggested, the issue has not been resolved.
When I create a new DFD specification and test it with only the StrictPosition=On option, the region with the rectangular border is formed, but the issue of not capturing the tax number inside remains the same.
The resolution of the PDF file seems fine, and I believe the data is of a vector type,
but I am unsure why this issue is occurring.
yesterday
Anyhow, I am using the method you suggested, Not using formula and Auto-Calculate flag, setting 'RegexMode=2' into Misc-parameters in order to get rid of empty spaces.
yesterday
Hi @Sangjun ,
I have encountered issues similar to what you mentioned. Below are the steps I usually follow. In some cases, I have observed that Decipher can automatically pick up the fields after performing these steps. You can give it a try.
• Take a backup of the training data.
• Delete the existing training data.
• Train the document again. When starting the training process, please select only the “Assignable” flag and remove the formulas and miscellaneous parameters.
Regards,
Athiban
yesterday
Hi @Sangjun ,
I agree that it might be useful to restart your training as a new DFD will use the same training data (unless the segregation option has been selected).
It can also sometimes be that the hyphen character "-" is not the same unicode character read by the pdf extraction or OCR engine. It could be worth trying this expression ^\d{3}.?\d{2}.?\d{5}$ .
Though the first step would certainly be to test it without any expression.
Thanks