cancel
Showing results for 
Search instead for 
Did you mean: 

Spy or parse HTML?

PawelMrozik
Level 3
Hello All,

I am new to the community and just getting my feet wet in the Blue Prism space. I managed to complete the Foundation Training and I have been working on implementing an RPA solution and have come across a snag.

I had a look at one of the threads to get more insight, but as I try different approaches, I get stuck as to which one to use.

Here is a simplified summary of the process:

1. An email needs to be pulled from Outlook based on the Sender and Subject
2. The email is in HTML format and is always the same, apart from the table entries
3. There are five table data elements that are of interest which need to be extracted

HTML Parsing

This is one of the approaches I have attempted.

1. I used the MS Outlook VBO to pull these emails into a collection and subsequently extract the email Body for parsing
2. The parsing is not pretty. I am essentially using the InStr and Mid functions to extract the data I need from the body. It would be best if I was able to properly parse HTML tables.

Even though I've mentioned that the HTML email is pretty much the same all the time, I would need to go through a batch of at least 100 for proper testing. I don't really like the above method, it just doesn't sit right. 

I know the HTML Agility Pack is mentioned in the thread above, but it seems that I would need additional permissions to be able to import it at work. 

Spying

The other approach, which was mentioned in the thread above is this:

1. Save the email as HTML
2. Open the HTML file in a browser window
3. Use spy mode to pull the data

Although not as efficient, I found this to be the far simpler and straightforward approach. The problem I came across is that the Outlook VBO does not allow me to save the email as HTML, it saves it as a MSG file. 

Once I have all the .MSG files saved, I would need to:

1. Reopen each file
2. Save as HTML
3. Open each HTML file in a browser window
4. Spy

For those who are far more experienced, how would you personally approach this?

Thank you in advance,

------------------------------
Pawel Mrozik
------------------------------
3 REPLIES 3

JohnMorgan
Level 2
You have a few more options...

  1. If the HTML is really XHTML, you can parse it as XML.
  2. If the HTML is not XHTML, but uses limited tags that are not self-closing, you can convert those tags to their self-closing form and parse the HTML as XML.
    1. I have had good success doing this. I just wrote a utility VBO with actions to self-close the necessary tags.
  3. You can use regular expressions to extract the data you need. This is not so good for reading tabular data, but it can work.


------------------------------
John Morgan
------------------------------

Thank you for your reply John, it is greatly appreciated.

I started looking at what I have been working with and had a eureka moment when I found XHTML and ran the code through an online parser to see what it would look like.

And then, I pulled the emails using VBO once again. I don't know if it has to do with the VBO object itself or what, but I was able to get plain text in the Body field of my collection. With the help of some Calculation stages, I was able to pull everything I needed. 

Thank you once again, your suggestions will definitely come in handy in the future.






------------------------------
Pawel Mrozik
------------------------------

HI Pawel.

Glad you solved it and great help from the community.

jack

------------------------------
Jack Look
Sr Product Consultant
Blue Prism
------------------------------