topic RE: extract data from pdf in Product Forum

extract data from pdf

lulis — Fri, 01 Mar 2024 18:06:00 GMT

Hi, Im learning and I want to know:

How can I extract the text from a pdf file?

Thanks so much

------------------------------
Lucia Lisdero
------------------------------

RE: extract data from pdf

ramesh.ravi — Sun, 03 Mar 2024 09:36:00 GMT

Hello Lucia,

You can try using the python script to read the pdf file.

The below library can be used to read the file.

Pytesseract

Command to install library: pip Install pytesseract

Once python is installed and required libraries are imported. Please follow below steps.

Step1: Open Command Prompt and type below command.

python textreader.py -f dictionary.pdf

Here, textreader.py is the python script file name and dictionary.pdf is input file.

Note: If you are automating this task using blue prism or any RPA tool. Then Launch Command Prompt via blue prism and pass the same command as mentioned above using write action or GSK.

Once you run the command you will get two notepad files (.txt) created in same location where input file got placed. Example image_data and text_data.

If you are trying to read an image file, then you will get the data extracted to image_data file and else all the data to be extracted under text_data file.

Please vote for this answer if you got the solution to read pdf file. 🙂

------------------------------
Ramesh Ravi
------------------------------

RE: extract data from pdf

LeonardoSQueiroz — Sun, 03 Mar 2024 18:43:00 GMT

Hello,

There is decipher, a specific solution for reading documents and PDFs, https://bpdocs.blueprism.com/decipher/user-guide/getting-started.htm

Regards,

------------------------------
Leonardo Soares
RPA Developer Tech Leader
América/Brazil
------------------------------

RE: extract data from pdf

Daniel_Sanhueza — Sun, 03 Mar 2024 23:33:00 GMT

Hello Lucia,
In addition to what others has pointed out. I want to mention that you can find on Dx pdfpig this is an asset to read PDF.

In our team, we follow a "rule" regarding PDFs: if a PDF is digital, structured, and you can select the text, it can be automated by obtaining the data and using regular expressions. It is very easy to extract data if you know the pre and post text of an entire string.

If a PDF is unstructured or it is an image, it can still be automated but using a different approach, such as using Decipher. However, you have to train those models to achieve a good percentage of accuracy.

Regards!

------------------------------
Daniel Sanhueza
RPA Professional Developer
Deloitte
America/Santiago
------------------------------