Reading PDFs without Ctrl+A & Ctrl+C
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
22-03-17 08:02 PM
At my current project, we have setup robots to read PDFs daily from Outlook through MAPIEx using Ctrl + A and Ctrl + C, then storing the clipboard to Blue Prism.
However, there's a problem with this technique, Ctrl + A & Ctrl + C doesn't store vital formatting information from the PDF.
For example, if a table in the PDF is setup like this:
A B C
1 23 4
Ctrl + A & Ctrl + C often stores the information as:
ABC
1234
Which can make it problematic, if not impossible to retrieve all information in larger / complex PDF tables as there's no control over which column the values are at (in some cases).
I've testet out iTextSharp, a .NET PDF library, and it works perfectly in keeping formatting information (tabs, spaces etc.)! However, it's not free for closed source projects (at least for v5 and above, but v4 doesn't have the right functionality).
So I'm wondering if anyone here knows of a another .NET PDF library that's free and works well with reading PDFs? Alternatively, if there's a way to keep the formatting in the .NET library with a code stage? (preferably in C#)
Thank you.
2 REPLIES 2
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
23-03-17 12:23 AM
Hi Fredrik,
have you tried to open the PDF file in Foxit Reader and save as text. I am not sure about your tables but this definitely kept the right formatting for me compared to Adobe Reader which completely destroyed any formatting. it was even worse than you described.
Z.
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
23-03-17 01:31 PM
Thank you!
I'll check out both PDF readers and compare them.
