Blue Prism Product

 View Only
last person joined: 11 hours ago 

This community covers the core Blue Prism RPA product.

 Extract text from PDF (scanned image)

Maneesh Vemula's profile image
Maneesh Vemula posted 12-06-2023 21:05

Hi All - I am trying to read the text from an scanned image which is basically saved in a PDF format. Since, this is an image and not an editable PDF - i am not able to use some of the existing VBO's from digital exchange to read the text.

One of the options i came across is the Cloud Vision API skill which has an action 'Document Text Extraction' but it only uses image as an input. Since it only accepts images, i am not able to send the PDF file as input. An alternate is to take screenshot and pass it as an image to the API - not sure if its the best approach.

I came across another functionality in Cloud Vision API (https://cloud.google.com/vision/docs/pdf) - 'Detext text in files(PDF/TIFF) - however this is not available via the BluePrism skill.

Please let me know of any solutions that you've implemented for this use case!

Thanks in advance!

Harish Mogulluri's profile image
Harish Mogulluri

Hi Maneesh Vemula,

In general Text recognizer in AWS and Form Recognizer in Azure will work for the requirement you are looking, 
There are plenty of other document extractions tools( like hyperscience, Abby and Google Vision API...)  are present some of them you need to convert the data from pdf to base 64 before trying to extract it.

Leonardo Soares's profile image
Leonardo Soares
Helo,
You can use Blue Prism Decipher or some other third-party tool to convert the image to text and then extract it.
Regards,