Filtering Out of Scope Documents

foehl · ‎12-03-24

I'm working with Decipher to classify documents but not to extract data. I have created DFDs and documents types to cover the valid documents that we want to recognise and decipher seems to be handling this mostly pretty well as long as I give it documents that are one of the two form types that we're interested in.

Our workflow stream means that there will also be a lot of "junk" accompanying the forms, and we want decipher to classify and reject/filter out/make an exception for these 'out-of-scope' files.

The problem I have is that currently, decipher divides its 100% confidence over all Document types, so if it's only 40% sure that it's a type 1 document, it means it's 60% sure it's a type 2 document. But in our scenario, having low confidence that a file is of type 1 does not necessarily mean that we are confident that it must be type 2.

How can we configure Decipher to consider this third type of file? We considered creating a third document type, but trying to train it to identify all other types of file seems like the wrong answer. Does anyone have some guidance on how best to tackle this?

------------------------------
Felix Oehl
Belgium
------------------------------

Ben.Lyons1 · ‎12-03-24

Hi Felix,

Unfortunately that's just how the classification stage works, I've had the same discussion with the development team. It's been suggested as an update, but it hasn't been added to the roadmap at this stage.

There are some additional classification methods planned for 2.4 which either don't require training or can be trained in a different way, but this won't be available until later this year (Release date TBC).

Thanks

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
SS&C Blue Prism
UK based
------------------------------

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

foehl · ‎12-03-24

Thanks for the quick response Ben! So the current way of working assumes that the document stream only contains files that match one or another file definition, is that correct?

What's the best way to deal with our out of scope documents at the moment then since we can't remove the out of scope documents from our input stream. They are fairly accurately being split out into exception batches by Decipher, so do we just have to review those batches for false negatives and then delete the rest?

------------------------------
Felix Oehl
Belgium
------------------------------

Ben.Lyons1 · ‎13-03-24

It's tricky, but that might be the case.

Alternatively you could create an extra document type and train it with examples of the out of scope documents. This is clearly going to be difficult if there's no consistency, but it might improve over time if this model is regularly updated with new samples. This would have to be done in the same environment where the classification model was created as the training for this isn't migrated between environments e.g. if the model is created in dev, then it would have to updated in dev (with the specific classification training batches upload).

Another method could be that if the documents in scope have something in them which could be trained in the DFD, then those without that field/data would be held as low confidence.

Thanks

------------------------------
Ben Lyons
Senior Product Specialist - Decipher
SS&C Blue Prism
UK based
------------------------------

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

Athiban_Mahamathi · ‎13-03-24

Hi Felix,

Instead of handling this in Decipher you can remove the type 3 documents before pushing to Decipher. I have been following this approach in all my Decipher projects due to its limitation to auto-exception or remove the out of scope documents. I assume that your type 1 and type 2 documents has some keywords. You can use normal PDF operations to extract the text(Pass Ctrl + A & Ctrl + C) and use InStr to filter the out of scope documents before pushing to decipher via BP code.

------------------------------
Athiban Mahamathi - https://www.linkedin.com/in/athiban-mahamathi-544a008b/
Technical Consultant,
SimplifyNext PTE LTD,
Singapore
------------------------------

foehl · ‎14-03-24

Hi Athiban, thanks for the response.

The issue is it's not just PDF's we're putting in, it's also .jpgs, .pngs .tiffs etc. Sometimes it's screenshots from a client's phone that has a photo of the document on it, so it's more complicated.

The other issue I have with that approach is: If we can identify the type 1 and type 2 documents in order to know to send only them to Decipher, why do we need Decipher? (Bear in mind in our current use case we are not extracting data).

------------------------------
Felix Oehl
Belgium
------------------------------

Athiban_Mahamathi · ‎21-03-24

Hi Felix,

Apologies, I missed the part where you had mentioned about Decipher usage in your use case. were you able to find any solutions ? I feel you can try the second approach suggested by Ben.

foehl · ‎02-05-24

In the end we decided that Decipher didn't match our needs for this use case. Since we were looking to use it as a sort of 'junk filter' to remove out of scope files from the workstream. Decipher seems to expect that this has already been done. So now we're looking at other possible solutions using AI to detect and separate documents (and photos of documents) from other images and photos.

Thanks @Ben.Lyons1 for the helpful suggestions. They helped us a lot with our POC and evaluation of the product.

foehl · ‎02-05-24

-

SS&C Blue Prism Community

Filtering Out of Scope Documents