Question on the confidence score

XavierGruchet · ‎01-07-22

Hi everyone,

I understand that a confidence score is calculated against each document. And then depending on the confidence score and the threshold, we can in the config either skip the classification verification or the data capture verification. Is that correct?

How can I retrieve the confidence score from Decipher? On which table is it available? Is the confidence score based on the machine learning model of the classification and the machine learning model of the data verification? Or are there different confidence scores available (for the classification and for the data verification).

For the data verification, is there a confidence score that I can retrieve against each captured field?

Thanks

BenLyons · ‎04-07-22

Hi Xavier,

The confidence scoring is by field, and then summarised at document level. There's a miscellaneous parameter that can be used to adjust the threshold at a field level "FCL=", but unfortunately the raw score is not reportable. This means you may need to carry out an element of trial and error.

There's a new Excel reporting tool due out shortly (currently with QA) which will help give you the outcome for each document and field, which may help. You can also view the document confidence on the Document History Report, 2 = High Confidence, 1 = Low Confidence.

We have some exciting reporting features we aim to add to v2.3 which will help better understand the document training progress.

Thanks

Ben Lyons Senior Product Specialist - Decipher SS&C Blue Prism UK based

XavierGruchet · ‎04-07-22

Thanks. To understand well, the confidence that you are mentioning is the confidence calculated at the field level. And there is the CCC (measuring the image to text quality) and FCL (measuring the correct position of the field). And they are not directly accessible for the moment. Is that correct?

Are they available in the application db within the xml structure? In the DocumentCaptureData db and inside the xml there is a field confidence for each captured item but the value is weird (made of 64 or 50 etc.,) it is interpretable?

How is the global confidence of the document made up out of the confidence of each field? How is it decided whether is it high or low confidence? What is the threshold? Is it linked with the fact that some captured fields don't have the expected format or that the control against the field that might have been implemented in the dfd failed? (Red cell or red value in the data verification interface)

I think this confidence is different from the classification confidence mentioned when created a new document type? Is there a way to retrieve this classification confidence and how is it calculated?

Thanks

BenLyons · ‎05-07-22

Hi Xavier,

Some great questions there and I'll do my best to answer them.

So CCL and FCL mean Character Confidence Limit and Field Confidence Limited respectively. And they're calculated as percentages, but not stored/available in that format. The CCL is calculated by the OCR engine and is considered 'high' if it's 86% or higher.

Decipher then calculates the FCL based on previous training and DFD configuration. High confidence is considered to be 95%+. However this isn't the feedback you see via the red characters in the Data Verification.

The CCL & FCL are combined (unfortunately I can't share the detail behind this) and Low confidence fields (Between 80% - 90%) are then coloured Red. There are developments in this space (targeting v2.3) to improve the clarity and available information.

On to how it's stored. If you have opted to encrypt your data (as we would of course recommend), the field storing this data is encrypted. If you've opted to have your data unencrypted, you can access this field, however it's not in a clear/easy-to-use format.

I hope my reporting process will help bridge this gap in the near term, here's a couple of screenshots to give you an idea of what's coming.

I'll discuss the classification confidence with the developers and come back to you.

I hope this helps.

Thanks

Ben

Ben Lyons Senior Product Specialist - Decipher SS&C Blue Prism UK based

XavierGruchet · ‎08-07-22

Thanks Ben. Waiting for your feedback on the classification confidence.

I understand that against the batch, there is a ML classification model to classify the different document types. And against a document type, there is a ML capture model to learn the different fields from the dfd. Is that correct?

This data capture confidence that you have described (combination of FCL and CCC) is based on the capture model of the document, is that correct?

But there is something I am confused with, where do I get at the document level the different level you mentioned (green, orange and red)? At the document level, I just get the low or high confidence flag. And by field, at the data capture verification, I can see the red cell or red content.
This low/high confidence flag is based on the combined score of FCL and CCC and following the threshold declared in the document type (by default 95), the document is flagged as high or low confidence?

But then what is the link with the classification confidence score? Because we can skip the classification verification and data capture verification if it is high confidence. But this high confidence is based on the capture model of the document, nothing to do with the classification model?

Then when is the classification model assessed?

Thanks

BenLyons · ‎11-07-22

Hi Xavier,

I've just caught up with the development and they advised the confidence score can be found the data table "Document" in the field "Class Info". This has all the info for what occurred during the classification stage, but it's not in the most user-friendly format (as it's not designed for external usage).

The format looks a bit like this:

<?xml version="1.0" encoding="utf-8"?>
<doc_class_info xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" sel_class_name="AP Attachment" sel_class_conf="99995" doc_sep_conf="197368">
<dcp name="A/P Invoice - Goods" prob="4" />
<dcp name="AP Attachment" prob="99995" />
</doc_class_info>

So you can see the selected document type was "AP Attachment", but it assigned probability's to both available document types "A/P Invoice" and AP Attachment". The probability's appear to be 4% and 95% respectively.

So you have a classification model, which only determines the document type, the native ML model (rules-based, always on) and the ML Capture model. The latter 2 can both be used for associating a region with a field.

The CCL is determined largely by the OCR engine, as it has "read" the document. The FCL is determined by Decipher's capture client, which uses the DFD, native ML model and ML capture model (if one exists, as they're not mandatory). But the FCL (nor CCL) is not recorded in any model, only the outcomes are.

There is no link between classification and capture training, they are entirely separate models, stages and engines.

You can set the auto-skip verification for the classification and data verification separately (and by Batch Type in v2.2), each will be based on their own confidence scoring mechanisms.

The data used to populate the Excel report is more raw than what is represented in the report, all the colour coding and phrasing is just how I've formatted the report.

I think I've answered everything, but I believe you have a session with Gabi on Wednesday so feel free to ask anything you're unsure about.

Thanks

Ben

Ben Lyons Senior Product Specialist - Decipher SS&C Blue Prism UK based

XavierGruchet · ‎11-07-22

Thanks Ben

If I understand well, in the batch type, when we decide to skip the data classification verfication or data capture verification, it is based either on the classification scoring and data capture scoring respectively. Depending on each scoring, there is a threshold to define whether the document has a high or low confidence.

To my understanding, the threshold to define if a document has a high or low confidence for the classification can be defined in the document type. It is by default 95%.

What is the threshold then to define if a document has a high or low confidence in terms of data capture? Is it just depending on the number of corrected updated fields (in green in your message)?

BenLyons · ‎12-07-22

Hi Xavier,

There's no single value you can amend to change the high/low threshold for data capture. This is done at a field level with the FCL parameter. If all fields are 'high', then the document is 'high'.

Thanks

Ben

Ben Lyons Senior Product Specialist - Decipher SS&C Blue Prism UK based

XavierGruchet · ‎12-07-22

Thanks Ben

SS&C Blue Prism Community

Question on the confidence score