<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic RE: extract data from pdf in Product Forum</title>
    <link>https://community.blueprism.com/t5/Product-Forum/extract-data-from-pdf/m-p/90090#M40377</link>
    <description>&lt;P&gt;Hello Lucia,&lt;BR /&gt;In addition to what others has pointed out. I want to mention that you can find on Dx &lt;A href="https://digitalexchange.blueprism.com/dx/entry/3439/solution/pdfpig"&gt;pdfpig&lt;/A&gt; this is an asset to read PDF.&amp;nbsp;&lt;/P&gt;
&lt;DIV class="flex flex-grow flex-col max-w-full"&gt;
&lt;DIV data-message-author-role="assistant" data-message-id="dbbeeed7-b24e-454f-b0c7-7cb01890aaba" class="min-h-[20px] text-message flex flex-col items-start gap-3 whitespace-pre-wrap break-words [.text-message+&amp;amp;]:mt-5 overflow-x-auto"&gt;
&lt;DIV class="markdown prose w-full break-words dark:prose-invert light"&gt;
&lt;P&gt;In our team, we follow a "rule" regarding PDFs: if a PDF is digital, structured, and you can select the text, it can be automated by obtaining the data and using regular expressions. It is very easy to extract data if you know the pre and post text of an entire string.&lt;/P&gt;
&lt;P&gt;If a PDF is unstructured or it is an image, it can still be automated but using a different approach, such as using Decipher. However, you have to train those models to achieve a good percentage of accuracy.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;Regards!&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&lt;/P&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Daniel Sanhueza&lt;BR /&gt;RPA Professional Developer&lt;BR /&gt;Deloitte&lt;BR /&gt;America/Santiago&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
    <pubDate>Sun, 03 Mar 2024 23:33:00 GMT</pubDate>
    <dc:creator>Daniel_Sanhueza</dc:creator>
    <dc:date>2024-03-03T23:33:00Z</dc:date>
    <item>
      <title>extract data from pdf</title>
      <link>https://community.blueprism.com/t5/Product-Forum/extract-data-from-pdf/m-p/90087#M40374</link>
      <description>&lt;P&gt;Hi, Im learning and I want to know:&lt;BR /&gt;&lt;BR /&gt;How can I extract the text from a pdf file?&lt;BR /&gt;&lt;BR /&gt;Thanks so much&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Lucia Lisdero&lt;BR /&gt;------------------------------&lt;/P&gt;</description>
      <pubDate>Fri, 01 Mar 2024 18:06:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/extract-data-from-pdf/m-p/90087#M40374</guid>
      <dc:creator>lulis</dc:creator>
      <dc:date>2024-03-01T18:06:00Z</dc:date>
    </item>
    <item>
      <title>RE: extract data from pdf</title>
      <link>https://community.blueprism.com/t5/Product-Forum/extract-data-from-pdf/m-p/90088#M40375</link>
      <description>&lt;P&gt;Hello Lucia,&lt;/P&gt;
&lt;P&gt;You can try using the python script to read the pdf file.&lt;/P&gt;
&lt;P&gt;The below library can be used to read the file.&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Pytesseract&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Command to install library: pip Install pytesseract&lt;/P&gt;
&lt;P&gt;Once python is installed and required libraries are imported. Please follow below steps.&lt;/P&gt;
&lt;P&gt;Step1: Open Command Prompt and type below command.&lt;/P&gt;
&lt;P&gt;python textreader.py -f dictionary.pdf&lt;/P&gt;
&lt;P&gt;Here, textreader.py is the python script file name and dictionary.pdf is input file.&lt;/P&gt;
&lt;P&gt;Note: If you are automating this task using blue prism or any RPA tool.&amp;nbsp; Then Launch Command Prompt via blue prism and pass the same command as mentioned above using write action or GSK.&lt;/P&gt;
&lt;P&gt;Once you run the command you will get two notepad files (.txt) created in same location where input file got placed. Example image_data and text_data.&lt;/P&gt;
&lt;P&gt;If you are trying to read an image file, then you will get the data extracted to image_data file and else all the data to be extracted under text_data file.&lt;/P&gt;
&lt;DIV class="media" style="overflow: hidden; zoom: 1;"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="24153.jpg"&gt;&lt;img src="https://community.blueprism.com/t5/image/serverpage/image-id/24284iC2AB552A20B4B080/image-size/large?v=v2&amp;amp;px=999" role="button" title="24153.jpg" alt="24153.jpg" /&gt;&lt;/span&gt;
&lt;DIV class="media" style="overflow: hidden; zoom: 1;"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="24154.jpg"&gt;&lt;img src="https://community.blueprism.com/t5/image/serverpage/image-id/24286i2BB98D44607F7AE7/image-size/large?v=v2&amp;amp;px=999" role="button" title="24154.jpg" alt="24154.jpg" /&gt;&lt;/span&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="24155.jpg"&gt;&lt;img src="https://community.blueprism.com/t5/image/serverpage/image-id/24287i85F2033C02E9FDD3/image-size/large?v=v2&amp;amp;px=999" role="button" title="24155.jpg" alt="24155.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Please vote for this answer if you got the solution to read pdf file. &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Ramesh Ravi&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Sun, 03 Mar 2024 09:36:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/extract-data-from-pdf/m-p/90088#M40375</guid>
      <dc:creator>ramesh.ravi</dc:creator>
      <dc:date>2024-03-03T09:36:00Z</dc:date>
    </item>
    <item>
      <title>RE: extract data from pdf</title>
      <link>https://community.blueprism.com/t5/Product-Forum/extract-data-from-pdf/m-p/90089#M40376</link>
      <description>&lt;DIV&gt;&lt;SPAN&gt;Hello,&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;There is decipher, a specific solution for reading documents and PDFs, &lt;A href="https://bpdocs.blueprism.com/decipher/user-guide/getting-started.htm" target="test_blank"&gt;https://bpdocs.blueprism.com/decipher/user-guide/getting-started.htm&lt;/A&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;Regards,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Leonardo Soares&lt;BR /&gt;RPA Developer Tech Leader&lt;BR /&gt;América/Brazil&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Sun, 03 Mar 2024 18:43:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/extract-data-from-pdf/m-p/90089#M40376</guid>
      <dc:creator>LeonardoSQueiroz</dc:creator>
      <dc:date>2024-03-03T18:43:00Z</dc:date>
    </item>
    <item>
      <title>RE: extract data from pdf</title>
      <link>https://community.blueprism.com/t5/Product-Forum/extract-data-from-pdf/m-p/90090#M40377</link>
      <description>&lt;P&gt;Hello Lucia,&lt;BR /&gt;In addition to what others has pointed out. I want to mention that you can find on Dx &lt;A href="https://digitalexchange.blueprism.com/dx/entry/3439/solution/pdfpig"&gt;pdfpig&lt;/A&gt; this is an asset to read PDF.&amp;nbsp;&lt;/P&gt;
&lt;DIV class="flex flex-grow flex-col max-w-full"&gt;
&lt;DIV data-message-author-role="assistant" data-message-id="dbbeeed7-b24e-454f-b0c7-7cb01890aaba" class="min-h-[20px] text-message flex flex-col items-start gap-3 whitespace-pre-wrap break-words [.text-message+&amp;amp;]:mt-5 overflow-x-auto"&gt;
&lt;DIV class="markdown prose w-full break-words dark:prose-invert light"&gt;
&lt;P&gt;In our team, we follow a "rule" regarding PDFs: if a PDF is digital, structured, and you can select the text, it can be automated by obtaining the data and using regular expressions. It is very easy to extract data if you know the pre and post text of an entire string.&lt;/P&gt;
&lt;P&gt;If a PDF is unstructured or it is an image, it can still be automated but using a different approach, such as using Decipher. However, you have to train those models to achieve a good percentage of accuracy.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;Regards!&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&lt;/P&gt;&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;Daniel Sanhueza&lt;BR /&gt;RPA Professional Developer&lt;BR /&gt;Deloitte&lt;BR /&gt;America/Santiago&lt;BR /&gt;------------------------------&lt;BR /&gt;</description>
      <pubDate>Sun, 03 Mar 2024 23:33:00 GMT</pubDate>
      <guid>https://community.blueprism.com/t5/Product-Forum/extract-data-from-pdf/m-p/90090#M40377</guid>
      <dc:creator>Daniel_Sanhueza</dc:creator>
      <dc:date>2024-03-03T23:33:00Z</dc:date>
    </item>
  </channel>
</rss>

