cancel
Showing results for 
Search instead for 
Did you mean: 

Best Solution to Read PDF

Anonymous
Not applicable
Guys,

What is the best solution of reading data from a pdf file?

I tried to find out in C# a code stage to do that, but found a lot old stuff about using iTextSharp, also the using iTextSharp.text.pdf.parser; is not used anymore so this code wouldn't work anymore. 
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;

public string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}
Thank you in advance~!


------------------------------
Cohen
RPA Developer

Romania
------------------------------
1 BEST ANSWER

Helpful Answers

Hi Cohen,

Since its an image pdf, C# codes will wont work. Since OCR is not flexible for you. You need to explore assets in Blue Prism Digital Exchange.

https://digitalexchange.blueprism.com/dx/

There are many assets to read the pdf such as Abby flexi capture, Rossum etc. But most of them are paid ones.

Other way is  through Python code and once the python code is developed you can call that python code in Blue prism to perform pdf reading.

Thanks,




------------------------------
Gerald J
Automation Engineer
10xds
Kerala/Kochi
+91-9159842805
------------------------------

View answer in original post

7 REPLIES 7

Gerald_J
Level 5
Hi Cohen,

What kind of pdf file you are using.

Is it readable ( can you do ctrl A+ Ctrl C and paste in text file)  ?

Or 

Is it a scanned document ?

Or

Is it an image converted to pdf ?

Thanks,

------------------------------
Gerald J
Automation Engineer
10xds
Kerala/Kochi
+91-9159842805
------------------------------

Anonymous
Not applicable
Someone told me that that DLL is wicked, but seems not. It can copy only text from it, but my PDFs are images of invoices etc, so you can not copy text from an embeded picture within a PDF 😄

Instead of OCR, do you guys have other solutions? OCR seems to be very slow.

------------------------------
Cohen
RPA Developer

Romania
------------------------------

Hi Cohen,

Since its an image pdf, C# codes will wont work. Since OCR is not flexible for you. You need to explore assets in Blue Prism Digital Exchange.

https://digitalexchange.blueprism.com/dx/

There are many assets to read the pdf such as Abby flexi capture, Rossum etc. But most of them are paid ones.

Other way is  through Python code and once the python code is developed you can call that python code in Blue prism to perform pdf reading.

Thanks,




------------------------------
Gerald J
Automation Engineer
10xds
Kerala/Kochi
+91-9159842805
------------------------------

As Gerald suggested, you may want to look at the OCR solutions on the DX such as being able to send a PDF to an API endpoint and receive the extracted text back. 

Incidentally, since you are dealing with invoices especially, you may be interested in Decipher when that comes out this year.

Also, glance over this other post where I gave an alternate solution as well that uses the same Tesseract engine exe that Blue Prism OCR does but instead of needing to open PDFs on the screen it is by command line: Another post recently about using OCR on PDFs

------------------------------
Dave Morris
3Ci @ Southern Company
Atlanta, GA
------------------------------

Dave Morris, 3Ci at Southern Company

Anonymous
Not applicable
Dave, used Teseract and this library is not moving fast.

------------------------------
Cohen
RPA Developer

Romania
------------------------------

Anonymous
Not applicable
Thank you Gerald! I heard Abby is wicked! Will see if they want to pay for it.

Thanks for your support!

------------------------------
Cohen
RPA Developer

Romania
------------------------------

You're welcome Cohen. You can try other vendors if Abby has problems.




------------------------------
Gerald J
Automation Engineer
10xds
Kerala/Kochi
+91-9159842805
------------------------------