<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic RE: Best Solution to Read PDF in Product Forum</title>
    <link>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82316#M33767</link>
    <description>Hi Cohen,&lt;BR /&gt;&lt;BR /&gt;Since its an image pdf, C# codes will wont work. Since OCR is not flexible for you. You need to explore assets in Blue Prism Digital Exchange.&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://digitalexchange.blueprism.com/dx/" target="_blank" rel="noopener"&gt;https://digitalexchange.blueprism.com/dx/&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;There are many assets to read the pdf such as Abby flexi capture, Rossum etc. But most of them are paid ones.&lt;BR /&gt;&lt;BR /&gt;Other way is&amp;nbsp; through Python code and once the python code is developed you can call that python code in Blue prism to perform pdf reading.&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Gerald J&lt;BR /&gt;Automation Engineer&lt;BR /&gt;10xds&lt;BR /&gt;Kerala/Kochi&lt;BR /&gt;+91-9159842805&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
    <pubDate>Wed, 26 Feb 2020 11:22:00 GMT</pubDate>
    <dc:creator>Gerald_J</dc:creator>
    <dc:date>2020-02-26T11:22:00Z</dc:date>
    <item>
      <title>Best Solution to Read PDF</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82313#M33764</link>
      <description>Guys,&lt;BR /&gt;&lt;BR /&gt;What is the best solution of reading data from a pdf file?&lt;BR /&gt;&lt;BR /&gt;I tried to find out in C# a code stage to do that, but found a lot old stuff about&amp;nbsp;&lt;CODE&gt;&lt;CODE&gt;&lt;SPAN class="kwd"&gt;using&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; iTextSharp, also the&amp;nbsp;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/CODE&gt;&lt;CODE&gt;&lt;CODE&gt;&lt;SPAN class="kwd"&gt;using&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; iTextSharp&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;text&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;pdf&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;parser&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;; is not used anymore so this code wouldn't work anymore.&amp;nbsp;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/CODE&gt;
&lt;PRE class="language-csharp"&gt;&lt;SPAN class="token keyword"&gt;using&lt;/SPAN&gt; iTextSharp&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;text&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;pdf&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
&lt;SPAN class="token keyword"&gt;using&lt;/SPAN&gt; iTextSharp&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;text&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;pdf&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;parser&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
&lt;SPAN class="token keyword"&gt;using&lt;/SPAN&gt; System&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;IO&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;

&lt;SPAN class="token keyword"&gt;public&lt;/SPAN&gt; &lt;SPAN class="token keyword"&gt;string&lt;/SPAN&gt; &lt;SPAN class="token function"&gt;ReadPdfFile&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token keyword"&gt;string&lt;/SPAN&gt; fileName&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;
&lt;SPAN class="token punctuation"&gt;{&lt;/SPAN&gt;
    StringBuilder text &lt;SPAN class="token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token keyword"&gt;new&lt;/SPAN&gt; &lt;SPAN class="token class-name"&gt;StringBuilder&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;

    &lt;SPAN class="token keyword"&gt;if&lt;/SPAN&gt; &lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;File&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token function"&gt;Exists&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;fileName&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;
    &lt;SPAN class="token punctuation"&gt;{&lt;/SPAN&gt;
        PdfReader pdfReader &lt;SPAN class="token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token keyword"&gt;new&lt;/SPAN&gt; &lt;SPAN class="token class-name"&gt;PdfReader&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;fileName&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;

        &lt;SPAN class="token keyword"&gt;for&lt;/SPAN&gt; &lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token keyword"&gt;int&lt;/SPAN&gt; page &lt;SPAN class="token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token number"&gt;1&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt; page &lt;SPAN class="token operator"&gt;&amp;lt;=&lt;/SPAN&gt; pdfReader&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;NumberOfPages&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt; page&lt;SPAN class="token operator"&gt;++&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;
        &lt;SPAN class="token punctuation"&gt;{&lt;/SPAN&gt;
            ITextExtractionStrategy strategy &lt;SPAN class="token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token keyword"&gt;new&lt;/SPAN&gt; &lt;SPAN class="token class-name"&gt;SimpleTextExtractionStrategy&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
            &lt;SPAN class="token keyword"&gt;string&lt;/SPAN&gt; currentText &lt;SPAN class="token operator"&gt;=&lt;/SPAN&gt; PdfTextExtractor&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token function"&gt;GetTextFromPage&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;pdfReader&lt;SPAN class="token punctuation"&gt;,&lt;/SPAN&gt; page&lt;SPAN class="token punctuation"&gt;,&lt;/SPAN&gt; strategy&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;

            currentText &lt;SPAN class="token operator"&gt;=&lt;/SPAN&gt; Encoding&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;UTF8&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token function"&gt;GetString&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;ASCIIEncoding&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token function"&gt;Convert&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;Encoding&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;Default&lt;SPAN class="token punctuation"&gt;,&lt;/SPAN&gt; Encoding&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;UTF8&lt;SPAN class="token punctuation"&gt;,&lt;/SPAN&gt; Encoding&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;Default&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token function"&gt;GetBytes&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;currentText&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
            text&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token function"&gt;Append&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;currentText&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
        &lt;SPAN class="token punctuation"&gt;}&lt;/SPAN&gt;
        pdfReader&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token function"&gt;Close&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
    &lt;SPAN class="token punctuation"&gt;}&lt;/SPAN&gt;
    &lt;SPAN class="token keyword"&gt;return&lt;/SPAN&gt; text&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token function"&gt;ToString&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
&lt;SPAN class="token punctuation"&gt;}&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;CODE&gt;&lt;SPAN class="pun"&gt;Thank you in advance~!&lt;BR /&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Cohen&lt;BR /&gt;RPA Developer&lt;BR /&gt;&lt;BR /&gt;Romania&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Tue, 25 Feb 2020 14:29:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82313#M33764</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2020-02-25T14:29:00Z</dc:date>
    </item>
    <item>
      <title>RE: Best Solution to Read PDF</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82314#M33765</link>
      <description>Hi Cohen,&lt;BR /&gt;&lt;BR /&gt;What kind of pdf file you are using.&lt;BR /&gt;&lt;BR /&gt;Is it readable ( can you do ctrl A+ Ctrl C and paste in text file)&amp;nbsp; ?&lt;BR /&gt;&lt;BR /&gt;Or&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Is it a scanned document ?&lt;BR /&gt;&lt;BR /&gt;Or&lt;BR /&gt;&lt;BR /&gt;Is it an image converted to pdf ?&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Gerald J&lt;BR /&gt;Automation Engineer&lt;BR /&gt;10xds&lt;BR /&gt;Kerala/Kochi&lt;BR /&gt;+91-9159842805&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Wed, 26 Feb 2020 05:37:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82314#M33765</guid>
      <dc:creator>Gerald_J</dc:creator>
      <dc:date>2020-02-26T05:37:00Z</dc:date>
    </item>
    <item>
      <title>RE: Best Solution to Read PDF</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82315#M33766</link>
      <description>Someone told me that that DLL is wicked, but seems not. It can copy only text from it, but my PDFs are images of invoices etc, so you can not copy text from an embeded picture within a PDF &lt;span class="lia-unicode-emoji" title=":grinning_face_with_smiling_eyes:"&gt;😄&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;Instead of OCR, do you guys have other solutions? OCR seems to be very slow.&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Cohen&lt;BR /&gt;RPA Developer&lt;BR /&gt;&lt;BR /&gt;Romania&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Wed, 26 Feb 2020 10:54:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82315#M33766</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2020-02-26T10:54:00Z</dc:date>
    </item>
    <item>
      <title>RE: Best Solution to Read PDF</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82316#M33767</link>
      <description>Hi Cohen,&lt;BR /&gt;&lt;BR /&gt;Since its an image pdf, C# codes will wont work. Since OCR is not flexible for you. You need to explore assets in Blue Prism Digital Exchange.&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://digitalexchange.blueprism.com/dx/" target="_blank" rel="noopener"&gt;https://digitalexchange.blueprism.com/dx/&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;There are many assets to read the pdf such as Abby flexi capture, Rossum etc. But most of them are paid ones.&lt;BR /&gt;&lt;BR /&gt;Other way is&amp;nbsp; through Python code and once the python code is developed you can call that python code in Blue prism to perform pdf reading.&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Gerald J&lt;BR /&gt;Automation Engineer&lt;BR /&gt;10xds&lt;BR /&gt;Kerala/Kochi&lt;BR /&gt;+91-9159842805&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Wed, 26 Feb 2020 11:22:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82316#M33767</guid>
      <dc:creator>Gerald_J</dc:creator>
      <dc:date>2020-02-26T11:22:00Z</dc:date>
    </item>
    <item>
      <title>RE: Best Solution to Read PDF</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82317#M33768</link>
      <description>As Gerald suggested, you may want to look at the OCR solutions on the DX such as being able to send a PDF to an API endpoint and receive the extracted text back.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Incidentally, since you are dealing with invoices especially, you may be interested in Decipher when that comes out this year.&lt;BR /&gt;&lt;BR /&gt;Also, glance over this other post where I gave an alternate solution as well that uses the same Tesseract engine exe that Blue Prism OCR does but instead of needing to open PDFs on the screen it is by command line:&amp;nbsp;&lt;A href="https://community.blueprism.com/communities/community-home/digestviewer/viewthread?GroupId=385&amp;amp;MessageKey=f614e7cc-0195-4238-8acc-35d91d11df97&amp;amp;CommunityKey=1e516cfe-4d1f-4de9-a9eb-58d15bf38c81&amp;amp;tab=digestviewer&amp;amp;ReturnUrl=%2fcommunities%2fallrecentposts&amp;amp;SuccessMsg=Thank%20you%20for%20submitting%20your%20message." target="_blank" rel="noopener"&gt;Another post recently about using OCR on PDFs&lt;/A&gt;&lt;A href="https://community.blueprism.com/communities/community-home/digestviewer/viewthread?GroupId=385&amp;amp;MessageKey=f614e7cc-0195-4238-8acc-35d91d11df97&amp;amp;CommunityKey=1e516cfe-4d1f-4de9-a9eb-58d15bf38c81&amp;amp;tab=digestviewer&amp;amp;ReturnUrl=%2fcommunities%2fallrecentposts&amp;amp;SuccessMsg=Thank%20you%20for%20submitting%20your%20message." target="_blank" rel="noopener"&gt;&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Dave Morris&lt;BR /&gt;3Ci @ Southern Company&lt;BR /&gt;Atlanta, GA&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Wed, 26 Feb 2020 12:03:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82317#M33768</guid>
      <dc:creator>david.l.morris</dc:creator>
      <dc:date>2020-02-26T12:03:00Z</dc:date>
    </item>
    <item>
      <title>RE: Best Solution to Read PDF</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82318#M33769</link>
      <description>Dave, used Teseract and this library is not moving fast.&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Cohen&lt;BR /&gt;RPA Developer&lt;BR /&gt;&lt;BR /&gt;Romania&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Wed, 26 Feb 2020 12:10:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82318#M33769</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2020-02-26T12:10:00Z</dc:date>
    </item>
    <item>
      <title>RE: Best Solution to Read PDF</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82319#M33770</link>
      <description>Thank you Gerald! I heard Abby is wicked! Will see if they want to pay for it.&lt;BR /&gt;&lt;BR /&gt;Thanks for your support!&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Cohen&lt;BR /&gt;RPA Developer&lt;BR /&gt;&lt;BR /&gt;Romania&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Wed, 26 Feb 2020 12:15:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82319#M33770</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2020-02-26T12:15:00Z</dc:date>
    </item>
    <item>
      <title>RE: Best Solution to Read PDF</title>
      <link>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82320#M33771</link>
      <description>You're welcome Cohen. You can try other vendors if Abby has problems.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Gerald J&lt;BR /&gt;Automation Engineer&lt;BR /&gt;10xds&lt;BR /&gt;Kerala/Kochi&lt;BR /&gt;+91-9159842805&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Wed, 26 Feb 2020 12:23:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/Best-Solution-to-Read-PDF/m-p/82320#M33771</guid>
      <dc:creator>Gerald_J</dc:creator>
      <dc:date>2020-02-26T12:23:00Z</dc:date>
    </item>
  </channel>
</rss>

