cancel
Showing results for 
Search instead for 
Did you mean: 

Reading PDFs without Ctrl+A & Ctrl+C

FredrikAdland
Level 4
At my current project, we have setup robots to read PDFs daily from Outlook through MAPIEx using Ctrl + A and Ctrl + C, then storing the clipboard to Blue Prism. However, there's a problem with this technique, Ctrl + A & Ctrl + C doesn't store vital formatting information from the PDF. For example, if a table in the PDF is setup like this: A B C 1 23 4 Ctrl + A & Ctrl + C often stores the information as: ABC 1234 Which can make it problematic, if not impossible to retrieve all information in larger / complex PDF tables as there's no control over which column the values are at (in some cases). I've testet out iTextSharp, a .NET PDF library, and it works perfectly in keeping formatting information (tabs, spaces etc.)! However, it's not free for closed source projects (at least for v5 and above, but v4 doesn't have the right functionality). So I'm wondering if anyone here knows of a another .NET PDF library that's free and works well with reading PDFs? Alternatively, if there's a way to keep the formatting in the .NET library with a code stage? (preferably in C#) Thank you.
2 REPLIES 2

Hi Fredrik, have you tried to open the PDF file in Foxit Reader and save as text. I am not sure about your tables but this definitely kept the right formatting for me compared to Adobe Reader which completely destroyed any formatting. it was even worse than you described. Z.

FredrikAdland
Level 4
Thank you! I'll check out both PDF readers and compare them.