cancel
Showing results for 
Search instead for 
Did you mean: 

Read/get text from an email and also from pdf attachment

MarilynGagarin
Level 4
Hello,
We have to read and get text from an email or from an email attachment (pdf) such as the invoice nbr, amount, vendor nbr and so forth. 
I know some already have done this and will need help how to.
1) how  can  I get the text value from the email body?
2)For pdf attachment,  I was able to save the attachment to a folder and also open the pdf but how can I get the text value from pdf like the Amount? it is remittance.pdf (not image). 
2a) For pdf, some mentioned OCR? is this a tool I need to install to get the text value?

I saw few  posts but not clear to me. 
Appreciate your help.
Thanks,
Marilyn
(Connecticut)

------------------------------
Marilyn Gagarin
Senior Programmer/Analyst
United Rental, Inc.
America/New_York
------------------------------
16 REPLIES 16

ewilson
Staff
Staff
Hi @MarilynGagarin,

Assuming you're using Outlook/Exchange, you can use the Outlook VBO to collect the content of emails to specific mail accounts. ​There's also a Microsoft Graph connector for Outlook that could be used if you choose. However, the Outlook VBO support more overall capability than the Graph-based connector at the moment.

To extract information from a PDF there are a few options. You might want to take a look at Decipher which is a Blue Prism's Intelligent Document Processing (IDP) platform. Alternatively, there are various PDF-related assets on the DX. Some are free while others have an associated cost.

There's also an example process available on the DX that shows various examples of extracting information. You might take a look at it if this is all new to you.

Cheers,

------------------------------
Eric Wilson
Director, Integrations and Enablement
Blue Prism Digital Exchange
------------------------------

Hi Eric,
Yes, this is all new to me and Thank you for replying.
I downloaded the asset  for MS Outlook Business Object - Utility.  We use MapiEX VBO for our sending and get emails and save attachment,
1) I will check this Outlook VBO.  With our BP 6.3version (will upgrade this year), I hope this is supported?  It just mentioned BluePrism Supported

2) also, downloaded asset for Utility – Strings (extended). I will check this too

3) For read/get text in pdf, I checked this Invoke - Itext Sharp but  our BP version will not be supported

4) I saw, Utility – PDF VBO. Same thing our BP version is not supported

5) Also, Tried to download below but says "This content is blocked, Contact site owner", where can I get these documents?
BluePrism MS Outlook VBO User Guide
Blueprism PDF Asset User guide

 

Appreciate all inputs and help.

Thanks,

Marilyn



------------------------------
Marilyn Gagarin
Senior Programmer/Analyst
United Rental, Inc.
America/New_York
------------------------------

@MarilynGagarin,

Those user guides are displayed as PDFs in a dynamic pop-up window (this not a new window or tab). Do you not see this when you click the links? You do have to be logged into the DX to see these.

26860.png
I believe most of those VBOs will work with 6.3. They probably don't list it as supported because it wasn't specifically tested under that version. You can always try it, and if you run into an issue let us know.

Cheers,


------------------------------
Eric Wilson
Director, Integrations and Enablement
Blue Prism Digital Exchange
------------------------------

Hi Eric,
I made sure I am logged in to DX but still same issue/error and not able to get the Documentation/Guides. I open case DX folks  and provided me a copy while they're figuring out the cause.
 I have few more questions here and need help.
1) I am able to save email into .msg file (using Save Email as File). How can I open this .msg file located from my local drive and save it as pdf file?
2) what is Tesseract? how does it work?  

Thanks,
Marilyn

------------------------------
Marilyn Gagarin
Senior Programmer/Analyst
United Rental, Inc.
America/New_York
------------------------------

@MarilynGagarin,

I wonder if your browser has any sort of Javascript execution limitations? That could be part of the issue. You could always try accessing the DX from a personal computer/laptop/iPad and see if you're able to view the documents.

As for your questions:
  1. The latest Outlook VBO, available on the DX, includes an action called Read from MSG. You can use that to open a .msg file. As for saving it as a PDF, that's a little bit more of a chore. There isn't an action on the Outlook VBO that supports this. However, there are alternatives.
    1. You could take the output from the Read from MSG action and create a temporary Word DOCX using the MS Word VBO. That VBO includes an action called ExportPDF. That would allow you to save the email as a PDF to file.
    2. Another option would be to create a separate VBO that automates the Outlook UI. With that, you could basically load the MSG and then call File -> Print on the UI, set the value of Printer to Microsoft Print to PDF, and then click the Print button. You'd also have to account for the Save As file dialog that pops up next, but it would work.
    3. If your Outlook instance has the Default Printer set to Microsoft Print to PDF you could add an additional action to the Outlook VBO that would load the MSG file and then call the PrintOut method of the MailItem object. The issue with this is that it doesn't accept any arguments. It just prints using all default settings. That's why the default printer would be to be Microsoft Print to PDF. The code for this would look something like the below snippet.
  2. As for Tesseract, it's an open source OCR engine. It's used by Decipher as part of its intelligent document processing capabilities I believe. I'm not a Decipher expert unfortunately. ☹ - UPDATE: Looks like Tesseract is also used as part of Surface Automation in Blue Prism. 
Dim app = CreateObject("Outlook.Application")
Dim item As Outlook.MailItem = app.CreateItemFromTemplate(MSGFilePath)

try
{
  item.PrintOut()
}
Finally
{
  item - Nothing
  app = Nothing
}
​

Cheers,

------------------------------
Eric Wilson
Director, Integrations and Enablement
Blue Prism Digital Exchange
------------------------------

hi Eric,
1) Yes, I was able to view documents by accessing it from my personal laptop. Thank you!
2) I was able to use Read from MSG action, save it to Word doc but when I used ExportPDF action I am getting error, see below.
  Note: under input Filename I tried leaving this blank, also putting filename extension .pdf but still getting this error bad filename.
  Note: [BodyFile] is the word doc named BodyTest.doc where the body from .msgfile was saved. 
Can you advice and tell me what am I missing here? Appreciate  your help

26863.png
26864.png

------------------------------
Marilyn Gagarin
Senior Programmer/Analyst
United Rental, Inc.
America/New_York
------------------------------

@MarilynGagarin,

You still have the document opened in the Word VBO when you attempt to export it to PDF, correct? If you've closed the DOCX before exporting it that would cause a problem. Also, on the filename, you do need to include the .pdf extension.

Here's a screenshot of the properties on my test:
26866.png
In this case [handle] is of course the numeric handle of the Word instance started with the call the Create Instance. [Document Name] is the value (i.e. TableTest.docx) that was returned to me from a call to Open Document. In your case, you should have gotten that name as output from a call to Save As I imagine (when you first saved the DOCX to file).

Cheers,


------------------------------
Eric Wilson
Director, Integrations and Enablement
Blue Prism Digital Exchange
------------------------------

Hi Eric,
Need advice and if you can provide more details or samples.  Thank you and appreciate it.
A. Choosing alternative#1: Below page worked  but having some issues:
1) when MSG is saved to WORD .doc file, the output format on tables is not same as when you manually print it to PDF, see screen shots below
Note:The output (body) from ReadMSG is saved to Text file named BodyTest.doc or BodyTest.docx, is this how you save it to WORD Doc?

2) when MSG is saved  to WORD .docx file, the word file cant be open and corrupted
Note: not sure if this will matter if file is save to .doc not .docx as long as the pdf output format is same as in .msg file format especially on the tables. 

B. Choosing alternative#2: Is this same like doing it manually? Like opened the .msg file, then use send key Ctrl-P or spy on File, then spy on Print box, send key ENTER, enter pdf filename.  How will I open/load msg, is there an action/page that does this in MS Outlook VBO -Extended? Once I will know that then I can do the rest. Can you give example or prototype
26871.png
This is the .MSG file
[IMAGE DELETED, CONTAINS POTENTIAL CONFIDENTIAL INFORMATION]

----if .MSG file body output is saved as .doc file, the data in table is in 2 lines see the Jobname column----
[IMAGE DELETED, CONTAINS POTENTIAL CONFIDENTIAL INFORMATION]

----when doing this manually (as in you open/or click the .msg file then hit Ctrl-P then click Print (defaulted to Microsoft Print to PDF), then enter pdf filename),  the PDF table is showing the same as in .MSG file, (see below) ----

[IMAGE DELETED, CONTAINS POTENTIAL CONFIDENTIAL INFORMATION]

------------------------------
Marilyn Gagarin
Senior Programmer/Analyst
United Rental, Inc.
America/New_York
------------------------------

Hi @MarilynGagarin,

After thinking about this a bit more, I'm surprised #1 worked at all. When I was thinking about the body of the email I was thinking about plain text, but as you've shown it's really HTML/RTF. What did you actually do in the Save Body step?

For option 2 there's no standard VBO for automating the UI of Outlook and I think I know why now. I tried to build a quick example app model for you, but it seems the UI of Outlook is not easily spied. ☹ So while the tables aren't quite formatted the exact same way in your #1 solution, I think you may have to accept it to get what you want.

I did come across some articles online that talk about saving the original email to either the HTML or MHT (also HTML) format instead of MSG. Those files can then be opened directly with Word and exported to PDF. Do you have flexibility in what format your messages are saved to file in?

Cheers,
​​​​

------------------------------
Eric Wilson
Director, Integrations and Enablement
Blue Prism Digital Exchange
------------------------------