02-12-24 01:15 AM
Hello, everyone,
I’m currently working on automating the processing of invoice documents using Decipher IDP.
However, I’ve encountered an issue where data extraction fails for a specific field on the document.
As shown in the screenshot, a bounding box is drawn around the invoice number, but the data itself isn’t extracted.
Interestingly, if I manually click the Refresh Region button, the correct value is extracted with 100% accuracy.
(In other words, the field is correctly read only when someone intervenes to press the Refresh Region button.)
I’d like to know if there’s a way to resolve this issue.
I made a DFD definition like below
ID : FT_1_USER_FIELD
Format : Text
Flags : Assignable, Required, AutoCalculate
Format Expression : ^\d{3}-?\d{2}-?\d{5}$ <---- an expression for tax number
Dependent Items : FT_1_USER_FIELD
Formula : STRREPLACE(FT_1_USER_FIELD, " ", "") <---- Aims To get rid of empty space
If anyone has encountered and solved a similar problem, I’d greatly appreciate your advice. Thank you!
Answered! Go to Answer.
03-12-24 01:44 AM
Hi @Sangjun ,
I have encountered issues similar to what you mentioned. Below are the steps I usually follow. In some cases, I have observed that Decipher can automatically pick up the fields after performing these steps. You can give it a try.
• Take a backup of the training data.
• Delete the existing training data.
• Train the document again. When starting the training process, please select only the “Assignable” flag and remove the formulas and miscellaneous parameters.
Regards,
Athiban
03-12-24 08:23 AM
Hi @Sangjun ,
I agree that it might be useful to restart your training as a new DFD will use the same training data (unless the segregation option has been selected).
It can also sometimes be that the hyphen character "-" is not the same unicode character read by the pdf extraction or OCR engine. It could be worth trying this expression ^\d{3}.?\d{2}.?\d{5}$ .
Though the first step would certainly be to test it without any expression.
Thanks
02-12-24 08:30 AM
Hi @Sangjun ,
I would recommend removing the formula and deselecting the Auto-Calculate flag as there is potentially a more efficient way to remove the unwanted space.
You can use the misc parameter "RegexMode" and set it to "2", which uses a fuzzy matching method and can automatically remove unwanted spaces.
You can see in my example where I'm using the expression [A-Z]{2}[0-9]{5} that the space is causing an error.
I then set the Regex Mode to 2.
And retry the same document, this time the space is removed by Decipher.
Ensure you test this with other documents as other characters can be replaced e.g. o can be changed to 0.
Also when using a formula like this, it's best to use the special variable SELF e.g. STRREPLACE(SELF, " ", "").
Kind Regards
03-12-24 01:02 AM
Dear Ben Lyons,
Even after applying the method you suggested, the issue has not been resolved.
When I create a new DFD specification and test it with only the StrictPosition=On option, the region with the rectangular border is formed, but the issue of not capturing the tax number inside remains the same.
The resolution of the PDF file seems fine, and I believe the data is of a vector type,
but I am unsure why this issue is occurring.
03-12-24 01:27 AM
Anyhow, I am using the method you suggested, Not using formula and Auto-Calculate flag, setting 'RegexMode=2' into Misc-parameters in order to get rid of empty spaces.
03-12-24 01:44 AM
Hi @Sangjun ,
I have encountered issues similar to what you mentioned. Below are the steps I usually follow. In some cases, I have observed that Decipher can automatically pick up the fields after performing these steps. You can give it a try.
• Take a backup of the training data.
• Delete the existing training data.
• Train the document again. When starting the training process, please select only the “Assignable” flag and remove the formulas and miscellaneous parameters.
Regards,
Athiban
03-12-24 08:23 AM
Hi @Sangjun ,
I agree that it might be useful to restart your training as a new DFD will use the same training data (unless the segregation option has been selected).
It can also sometimes be that the hyphen character "-" is not the same unicode character read by the pdf extraction or OCR engine. It could be worth trying this expression ^\d{3}.?\d{2}.?\d{5}$ .
Though the first step would certainly be to test it without any expression.
Thanks
a month ago
Here is the English translation:
I would like to first express my gratitude to Athiban_Mahamathi and Ben Lyons.
Following the advice from both of you, I cleared the DFD options and Training Data and tried again from scratch.
When the CLASSIFICATION model for document classification is not applied, everything proceeds as expected. However, when a mixed document classification model with three document types is applied, the issue of not being able to capture seems to persist.
I am using the latest version of IDP with Korean set as the language. I'm not sure if it's specifically related to Korean, but at this point, I have no choice but to continue working with the current state.
Once again, I would like to thank both of you for your help.
a month ago
Hi @Sangjun ,
It may be worth looking at setting the language at Document Type level. You can set language and locale in both Document Type and Batch Type because this allows you to have a default language for all Document Types set in the Batch Type.
However, if you set a language and locale in the Document Type, it will supersede/override the one set in the Batch Type. This is to allow you to have multiple languages processed within the same Batch Type. This may be affected by the workflow when using a classification model.
If it's still not working as expected, please raise a support ticket so that we can investigate further.
Thanks
a month ago - last edited a month ago
@Sangjun , glad to hear that you were able to resolve the issue. Regarding the language setup, I would advise you to try the workaround suggested by Ben.