How to just capture raw text?

XavierGruchet · ‎09-11-21

Hello everyone,

We are just starting using Decipher and what we are more interested in as a starting point is to extract the raw text from the pages. Without structuring them in Decipher. We are just indeed capturing key words to recognise few items so we dont need to map all the pages. Is there a way to just capture the text with Decipher? Ideally for each document, a text string by page.

Thanks

Ben.Lyons1 · ‎10-11-21

Hi Xavier,

Decipher is intended to be used for extracting data into a structured format and will be returned to Blue Prism as a collection. Unfortunately this means there's no feature that would enable you to extract the full document.

Do you have a use case where this would be helpful or is this just for testing?

Thanks

Ben

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

XavierGruchet · ‎11-11-21

Hey Bens, thanks for your answer. I don’t know how to reply to your answer so I am replying to my question.
so actually the documents we want to capture are all combined together in one pdf. It contains different types of invoices and forms. Is there a way to flag by page the type of document to teach decipher or I need to extract them upfront to then feed them seperately to decipher. And then use the classifier in the combined pdf?
and another thing is that we are already mapping the different invoices depending on key words. We are then able to recognise the first page and the following pages of the same invoice based on key words. Key word for the first page and then key words that say the next page are additional pages of the invoice. How to teach that to decipher?
will there be possible to organise a quick chat to see what is possible in Blueprism and to organise a consultancy maybe? You can contact me in private to discuss further. Thanks a lot

Ben.Lyons1 · ‎12-11-21

Hi Xavier,

The document classification stage can be used to intelligently split documents, you would need to train a classification model.

However, if your documents are all single pages, you can set the flag when creating the batch to split all file into single page documents. You will still need a classification model to assign the appropriate Document Type (and DFD), but you will ensure all documents are consistently separated.

If you have the appropriate level of support agreement (which you will likely have if you have Decipher), you can request an Expert Connect session. Reach out to your account manager for more details.

Thanks

Ben

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

XavierGruchet · ‎12-11-21

Thanks Ben. Actually our documents can be one page or several pages, it depends. Thats why we need to check the type but also if the next pages belong to the same document.
To teach a classification model, do I need to feed the different types of documents (which can be one page or several pages) separated to Decipher. For example, invoice type 1, I collect all the PDFs with invoice type 1, can be one page or multiple pages and then I send that to Decipher for the learning process? Or I can directly feed to Decipher a combined pdf with pages that belong to invoice type 1, pages that belong to invoice type 2 etc, I just have to flag them?
And if I have different invoice templates, they all belong to a different DFD or I can define a common DFD with some variants? How close the different invoice templates must be to each other for the variants under the same DFD to work?
Thanks

Ben.Lyons1 · ‎12-11-21

Hi Xavier,

To train a classification model, you will need to upload batches which only contain examples of each respective document type. So you will have a batch of Doc Type A and a separate batch of Doc Type B. These examples will have to have been manually separated.

You use the Decipher action to upload a classification training batch, as it goes through a different process. Don't mark the model for training until all training batches are ready for classification training. Do mark it as extensible so that more than 1 document type can be trained.

Thanks

Ben

Ben Lyons
Principal Product Specialist - Decipher
SS&C Blue Prism
UK based

SS&C Blue Prism Community

How to just capture raw text?