Tying the Training Data Mapping Used for a Specific Decipher Document ID?

stepher · ‎19-04-23

Question: Within the Decipher database, is there a way to determine which Training Data mapping was used to process a specific Decipher Document?

Background: We are using Decipher v2.1. Additionally, we have a document supplied by a vendor where the values are contained within a grid. Decipher incorrectly over scans a specific value, including the grid as a “|” at the beginning of the value. Coincidentally, the summary report that we generate out of Blue Prism is pipe-delimited, so this causes downstream issues.

I also know that the Training Data file is a renamed compressed folder, comprised of the various document mappings. My thought was that if I could determine which specific mapping(s) was used to process the erroneous documents, I could [eventually] remove them from the training data set. Hopefully, this would cause this supplier’s documents to stop at Data Verification and allow for a more correct mapping.

Is this even possible?

Thanks,
Red

Robert "Red" Stephens Application Developer, RPA Sutter Health Sacramento, CA

Ben.Lyons1 · ‎20-04-23

Hi Red,

There's a relatively straight forward way to do this, you just need access to the Decipher database.

Every time the corresponding template in the training data is matched with a document, a field in the database is updated with the date and time of the match. So if you upload a batch with just one of those documents, you can see which template it was matched to.

Check in the RegionTemplate table, and the field LastMatchedOn. The Id will correspond with the file in the training data export.

Also with respect to the actual challenge you're experiencing with the table lines. There have been a few improvements in v2.2 that may help resolve it.

If the document is a pdf with vector data (i.e. the text can be selected by the user), then an update to how this data is used may help. In 2.1 and earlier, this data is extracted and often used instead of running the OCR stage. However, when Decipher detects there's information it couldn't read from the vector data, it runs a full OCR of the document and uses this instead of the vector data.

OCR data is more likely to have difficulty with table borders and so you don't get the best experience. In 2.2 these sources are merged, enabling you to get the best of both worlds and potentially removing the table borders issue.

If that's not the case there were also updates to region segmentation also improving how table borders are processed and general table processing updates.

It might worth just trying a local install to see if it helps.

Thanks

Ben

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

SS&C Blue Prism Community

Tying the Training Data Mapping Used for a Specific Decipher Document ID?