Ok, so I actually tried something yesterday for the first time regarding this. This was suggested to me by a much smarter person than I am so there's that.
I'm assuming you're using the navigate action 'Recognize Text'. I forget what it's called at the moment, but I'm referring to the one that uses the Google Tesseract OCR engine that comes with Blue Prism now. So, what you can do is actually still use that same engine and even use the same thing Blue Prism is using to read the text off the PDF, but you can do it without ever opening Adobe and without dealing with any UI.
I've only half proofed this out (the 2nd half below), but I have confirmed this should all work fairly reliably. I cannot attach screenshots from where I am, so I'll just be using text unfortunately.
You need two components/parts/pieces: (1) a code stage to convert your PDF to an image type like .PNG or .TIF AND (2) use the command line commands to call the Tesseract exe in the Blue Prism Automate\Tesseract folder.
For (1), I haven't done this part yet, but it really looks as simple as downloading a DLL or two and copy/pasting some code into a c# code stage. Try here:
https://stackoverflow.com/questions/23905169/how-to-convert-pdf-files-to-image. If I end up proofing this out as well, I'll try to remember to come back and update this post or reply again.
For (2), I imagine you could use the 'Utility - Environment' VBO for one of its actions that runs processes in order to run the Tesseract exe. I did have luck with it 'working', but Blue Prism is receiving standard error output text that shouldn't be error output. Anyway, you may see this when you try it. First though, try from command line.
This command: C:\Program Files\Blue Prism Limited\Blue Prism Automate\Tesseract\tesseract-4.0.0 C:\Temp\test.tif C:\Temp\out
Replace the first path 'C:\Temp\test.tif' with wherever your input file is and the second path 'C:\Temp\out' with where you want the output to go. Don't put a .txt extension on the output location because Tesseract will do that.
Two things to note:
- Supposedly TIF is the best file type to use for this, though PNG/JPG etc. may also work.
- From my minimal research and testing, the image has to be exactly 300 DPI. Converters typically will ask you what DPI you want especially for converting to TIF files.
The text output will go into the output text file, and then you can read that into Blue Prism and do string manipulation on it.
------------------------------
Dave Morris
3Ci @ Southern Company
Atlanta, GA
------------------------------
Dave Morris, 3Ci at Southern Company