01-03-24 06:06 PM
Hi, Im learning and I want to know:
How can I extract the text from a pdf file?
Thanks so much
------------------------------
Lucia Lisdero
------------------------------
03-03-24 09:36 AM
Hello Lucia,
You can try using the python script to read the pdf file.
The below library can be used to read the file.
Command to install library: pip Install pytesseract
Once python is installed and required libraries are imported. Please follow below steps.
Step1: Open Command Prompt and type below command.
python textreader.py -f dictionary.pdf
Here, textreader.py is the python script file name and dictionary.pdf is input file.
Note: If you are automating this task using blue prism or any RPA tool. Then Launch Command Prompt via blue prism and pass the same command as mentioned above using write action or GSK.
Once you run the command you will get two notepad files (.txt) created in same location where input file got placed. Example image_data and text_data.
If you are trying to read an image file, then you will get the data extracted to image_data file and else all the data to be extracted under text_data file.
Please vote for this answer if you got the solution to read pdf file. 🙂
03-03-24 06:43 PM
03-03-24 11:33 PM
Hello Lucia,
In addition to what others has pointed out. I want to mention that you can find on Dx pdfpig this is an asset to read PDF.
In our team, we follow a "rule" regarding PDFs: if a PDF is digital, structured, and you can select the text, it can be automated by obtaining the data and using regular expressions. It is very easy to extract data if you know the pre and post text of an entire string.
If a PDF is unstructured or it is an image, it can still be automated but using a different approach, such as using Decipher. However, you have to train those models to achieve a good percentage of accuracy.
Regards!