29-03-23 03:46 AM
Hi All,
I have encountered 2 issues lately while working in surface automation to automate PDF extraction.
Issue 1 : I'm using colour of the checkbox to identify whether it is editable or non editable form.(Blue identifies as editable form and white identifies as non editable). Below shared the screenshots of the check box.
For editable form,
For non editable form,
I have used location method as image and position to be anywhere. it is expected to find the checkbox anywhere in the screen even if the value of the checkbox changes to select Yes value. However, the bot is currently unable to locate the checkbox if the value selected is "Yes," as I spied the element while it was set to "No" as shown above. Could you please recommend a solution that will allow me to locate the checkbox regardless of its current value and position it anywhere on the screen?
Issue 2: I am using Read with OCR to extract NRIC number from the non editable form.
For most of the cases it is working as expected but for some cases it reads S as $, I as ), O as 0. I have played with page segmentation and scale combinations. none of them works. Is there any alternate ways to mitigate this issue?
Answered! Go to Answer.
31-03-23 11:27 AM
Hmmm, I would do a regression test then. Try to get a sample of at least 50-100 PDFs (the bigger, the better, but also it will take you longer to verify the Replace logic you will need to implement) and then see what are the most common errors made by Tesseract during the OCR parsing. If you can really narrow it down to those three, then solution is pretty straightforward:
IF Left([ocr_output], 1) = 0 Then
[ocr_output] = "O" & Right([ocr_output], Len([ocr_output])-1)
ELSE IF Right([ocr_output], 1) = 0 Then
[ocr_output] = Left([ocr_output], Len([ocr_output])-1) & "O"
End IF
30-03-23 09:49 AM
For issue 1, I would advice you to use the Label "Smoking Status" as an anchor (image method) and the relative positions of the Yes and no checkboxes as relative coordinates based on that anchor. If all you need to find out is if the checkbox is editable based on the colour, then follow the same logic but instead spy the relative region on the checkbox as an image, and take a 2x2 pixel area either on the upper left or bottom right corner from the yes checkbox. The goal is for blueprism to recognize the colour of the checkbox disregarding the status.
For Issue 2: AFAIK, the in-built read with OCR is rather tricky. You can tweak it but you cannot really train the model, so it is a matter of trial an error. If the PDF is not rasterized, perhaps you could try using something like this https://digitalexchange.blueprism.com/dx/entry/3439/solution/extracting-data-from-text-2 or this https://community.blueprism.com/communities/community-home/digestviewer/viewthread?MessageKey=d89ace27-3062-400b-8faf-0186dea3b35e&CommunityKey=3743dbaa-6766-4a4d-b7ed-9a98b6b1dd01#bmd89ace27-3062-400... (Reply #4)
31-03-23 08:21 AM
Thank you @Ramón Requena López for the suggestion.
For issue 1, I tried your method with some additional tweaks, it's working. thank you so much for the workaround.
For issue 2, I'm still unable to find out a solution. The client has restrictions on using any external third-party application or API as the data extracted are PI information. Do you have any alternate approach to solve the issue?
31-03-23 08:30 AM
No bigge Amrutha, glad I could help 😉
Can you, by any chance, save the PDF in any shape of text document (.txt, .doc)? That way you could parse it with a RegEx relatively easy without the need of resorting to OCR solutions.
31-03-23 08:52 AM
Sadly, it doesn't.
I have experimented with several approaches. Python libraries, Copy and paste method, Power Automate, Adobe Reader Pro to convert the non-editable PDF to Word and Excel and finally Read with OCR functionality in BluePrism using Google's Tesseract, which was able to read the non-editable PDFs as text. Also I fine-tune the OCR engine using parameters like page segmentation, scaling, and character whitelist. It is working for some cases but for some, OCR had difficulty reading a few combinations of characters like S, I, O, which were read as $, ), 0.
31-03-23 11:27 AM
Hmmm, I would do a regression test then. Try to get a sample of at least 50-100 PDFs (the bigger, the better, but also it will take you longer to verify the Replace logic you will need to implement) and then see what are the most common errors made by Tesseract during the OCR parsing. If you can really narrow it down to those three, then solution is pretty straightforward:
IF Left([ocr_output], 1) = 0 Then
[ocr_output] = "O" & Right([ocr_output], Len([ocr_output])-1)
ELSE IF Right([ocr_output], 1) = 0 Then
[ocr_output] = Left([ocr_output], Len([ocr_output])-1) & "O"
End IF