Extract text from PDF (scanned image)

maneesh.vemula1 · ‎06-12-23

Hi All - I am trying to read the text from an scanned image which is basically saved in a PDF format. Since, this is an image and not an editable PDF - i am not able to use some of the existing VBO's from digital exchange to read the text.

One of the options i came across is the Cloud Vision API skill which has an action 'Document Text Extraction' but it only uses image as an input. Since it only accepts images, i am not able to send the PDF file as input. An alternate is to take screenshot and pass it as an image to the API - not sure if its the best approach.

I came across another functionality in Cloud Vision API (https://cloud.google.com/vision/docs/pdf) - 'Detext text in files(PDF/TIFF) - however this is not available via the BluePrism skill.

Please let me know of any solutions that you've implemented for this use case!

Thanks in advance!

harish.mogulluri · ‎07-12-23

Hi Maneesh Vemula,

In general Text recognizer in AWS and Form Recognizer in Azure will work for the requirement you are looking,
There are plenty of other document extractions tools( like hyperscience, Abby and Google Vision API...) are present some of them you need to convert the data from pdf to base 64 before trying to extract it.

-----------------------
If I answered your query. Please mark it as the Best Answer

Harish Mogulluri

LeonardoSQueiroz · ‎07-12-23

Helo,

You can use Blue Prism Decipher or some other third-party tool to convert the image to text and then extract it.

Regards,

Leonardo Soares RPA Developer América/Brazil

SS&C Blue Prism Community

Extract text from PDF (scanned image)