<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic RE: Spy or parse HTML? in Product Forum</title>
    <link>https://community.blueprism.com/t5/Product-Forum/Spy-or-parse-HTML/m-p/55485#M9817</link>
    <description>You have a few more options...&lt;BR /&gt;&lt;BR /&gt;
&lt;OL&gt;
&lt;LI&gt;If the HTML is really XHTML, you can parse it as XML.&lt;/LI&gt;
&lt;LI&gt;If the HTML is not XHTML, but uses limited tags that are not self-closing, you can convert those tags to their self-closing form and parse the HTML as XML.
&lt;OL&gt;
&lt;LI&gt;I have had good success doing this. I just wrote a utility VBO with actions to self-close the necessary tags.&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;LI&gt;You can use regular expressions to extract the data you need. This is not so good for reading tabular data, but it can work.&lt;/LI&gt;
&lt;/OL&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;John Morgan&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
    <pubDate>Tue, 08 Mar 2022 14:36:00 GMT</pubDate>
    <dc:creator>JohnMorgan</dc:creator>
    <dc:date>2022-03-08T14:36:00Z</dc:date>
    <item>
      <title>Spy or parse HTML?</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Spy-or-parse-HTML/m-p/55484#M9816</link>
      <description>Hello All,&lt;BR /&gt;&lt;BR /&gt;I am new to the community and just getting my feet wet in the Blue Prism space. I managed to complete the Foundation Training and I have been working on implementing an RPA solution and have come across a snag.&lt;BR /&gt;&lt;BR /&gt;I had a look at &lt;A href="https://community.blueprism.com/communities/community-home/digestviewer/viewthread?MessageKey=e31b25a6-5827-457b-90cf-c66130ff9802&amp;amp;CommunityKey=9efa2ecd-62b0-458a-b356-9b64643dccc5&amp;amp;tab=digestviewer" target="_blank" rel="noopener"&gt;one of the threads&lt;/A&gt; to get more insight, but as I try different approaches, I get stuck as to which one to use.&lt;BR /&gt;&lt;BR /&gt;Here is a simplified summary of the process:&lt;BR /&gt;&lt;BR /&gt;1. An email needs to be pulled from Outlook based on the&amp;nbsp;&lt;EM&gt;Sender &lt;/EM&gt;and &lt;EM&gt;Subject&lt;BR /&gt;&lt;/EM&gt;2. The email is in&amp;nbsp;&lt;EM&gt;HTML &lt;/EM&gt;format and is always the same, apart from the table entries&lt;BR /&gt;3. There are five table data elements that are of interest which need to be extracted&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;HTML Parsing&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;This is one of the approaches I have attempted.&lt;BR /&gt;&lt;BR /&gt;1. I used the MS Outlook VBO to pull these emails into a collection and subsequently extract the email Body for parsing&lt;BR /&gt;2. The parsing is not pretty. I am essentially using the InStr and Mid functions to extract the data I need from the body. It would be best if I was able to properly parse HTML tables.&lt;BR /&gt;&lt;BR /&gt;Even though I've mentioned that the HTML email is pretty much the same all the time, I would need to go through a batch of at least 100 for proper testing. I don't really like the above method, it just doesn't sit right.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I know the HTML Agility Pack is mentioned in the thread above, but it seems that I would need additional permissions to be able to import it at work.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Spying&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;The other approach, which was mentioned in the thread above is this:&lt;BR /&gt;&lt;BR /&gt;1. Save the email as HTML&lt;BR /&gt;2. Open the HTML file in a browser window&lt;BR /&gt;3. Use spy mode to pull the data&lt;BR /&gt;&lt;BR /&gt;Although not as efficient, I found this to be the far simpler and straightforward approach. The problem I came across is that the Outlook VBO does not allow me to save the email as HTML, it saves it as a MSG file.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Once I have all the .MSG files saved, I would need to:&lt;BR /&gt;&lt;BR /&gt;1. Reopen each file&lt;BR /&gt;2. Save as HTML&lt;BR /&gt;3. Open each HTML file in a browser window&lt;BR /&gt;4. Spy&lt;BR /&gt;&lt;BR /&gt;For those who are far more experienced, how would you personally approach this?&lt;BR /&gt;&lt;BR /&gt;Thank you in advance,&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Pawel Mrozik&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Tue, 08 Mar 2022 12:34:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Spy-or-parse-HTML/m-p/55484#M9816</guid>
      <dc:creator>PawelMrozik</dc:creator>
      <dc:date>2022-03-08T12:34:00Z</dc:date>
    </item>
    <item>
      <title>RE: Spy or parse HTML?</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Spy-or-parse-HTML/m-p/55485#M9817</link>
      <description>You have a few more options...&lt;BR /&gt;&lt;BR /&gt;
&lt;OL&gt;
&lt;LI&gt;If the HTML is really XHTML, you can parse it as XML.&lt;/LI&gt;
&lt;LI&gt;If the HTML is not XHTML, but uses limited tags that are not self-closing, you can convert those tags to their self-closing form and parse the HTML as XML.
&lt;OL&gt;
&lt;LI&gt;I have had good success doing this. I just wrote a utility VBO with actions to self-close the necessary tags.&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;LI&gt;You can use regular expressions to extract the data you need. This is not so good for reading tabular data, but it can work.&lt;/LI&gt;
&lt;/OL&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;John Morgan&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Tue, 08 Mar 2022 14:36:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Spy-or-parse-HTML/m-p/55485#M9817</guid>
      <dc:creator>JohnMorgan</dc:creator>
      <dc:date>2022-03-08T14:36:00Z</dc:date>
    </item>
    <item>
      <title>RE: Spy or parse HTML?</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Spy-or-parse-HTML/m-p/55486#M9818</link>
      <description>Thank you for your reply John, it is greatly appreciated.&lt;BR /&gt;&lt;BR /&gt;I started looking at what I have been working with and had a eureka moment when I found XHTML and ran the code through an online parser to see what it would look like.&lt;BR /&gt;&lt;BR /&gt;And then, I pulled the emails using VBO once again. I don't know if it has to do with the VBO object itself or what, but I was able to get plain text in the Body field of my collection. With the help of some Calculation stages, I was able to pull everything I needed.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Thank you once again, your suggestions will definitely come in handy in the future.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Pawel Mrozik&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Tue, 08 Mar 2022 23:04:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Spy-or-parse-HTML/m-p/55486#M9818</guid>
      <dc:creator>PawelMrozik</dc:creator>
      <dc:date>2022-03-08T23:04:00Z</dc:date>
    </item>
    <item>
      <title>RE: Spy or parse HTML?</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Spy-or-parse-HTML/m-p/55487#M9819</link>
      <description>HI Pawel.&lt;BR /&gt;&lt;BR /&gt;Glad you solved it and great help from the community.&lt;BR /&gt;&lt;BR /&gt;jack&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Jack Look&lt;BR /&gt;Sr Product Consultant&lt;BR /&gt;Blue Prism&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Wed, 09 Mar 2022 15:18:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Spy-or-parse-HTML/m-p/55487#M9819</guid>
      <dc:creator>lookman</dc:creator>
      <dc:date>2022-03-09T15:18:00Z</dc:date>
    </item>
  </channel>
</rss>

