22-06-24 03:49 PM
I have to compare 2 sheets, which has around 1 lakh data. Two sheets ( A & B) A has two columns A1 and A2. B has two columns B1 and B2. In my case Some of the value in A and B1 are same, for such cases I want to update B2 = A2. So A2 should take the value from B2 when A1 = B1.
I tried with utility collection but it's not working and i have tried with 20K data which is taking lot of time. If we can achieve this by oledb or any other way. Can anyone help on this..
Answered! Go to Answer.
23-06-24 10:49 AM
Dear @Karthik.choppadandi
Collection doesn't work for the huge amount of data in Excel, you need to create a Custome code object and call the page in the process to get the result in Excel.
Please try the below code in code stage
Sub UpdateSheetB()
' Declare variables
Dim wsA As Worksheet, wsB As Worksheet
Dim lastRowA As Long, lastRowB As Long
Dim dict As Object
Dim i As Long
' Set worksheets
Set wsA = ThisWorkbook.Sheets("SheetA") ' Replace with the name of your first sheet
Set wsB = ThisWorkbook.Sheets("SheetB") ' Replace with the name of your second sheet
' Get the last row in each sheet
lastRowA = wsA.Cells(wsA.Rows.Count, "A").End(xlUp).Row
lastRowB = wsB.Cells(wsB.Rows.Count, "A").End(xlUp).Row
' Create a dictionary to store the values from SheetA
Set dict = CreateObject("Scripting.Dictionary")
' Loop through SheetA and add values to the dictionary
For i = 1 To lastRowA
dict(wsA.Cells(i, 1).Value) = wsA.Cells(i, 2).Value
Next i
' Loop through SheetB and update values
For i = 1 To lastRowB
If dict.exists(wsB.Cells(i, 1).Value) Then
wsB.Cells(i, 2).Value = dict(wsB.Cells(i, 1).Value)
End If
Next i
' Clean up
Set dict = Nothing
End Sub
Input Parameters:
SheetA - your Worksheet name
SheetB - your worksheet name
lastRowA - Row count of last row for SheetA
lastRowB - Row count of last row for SheetB
Please apply this code and see the result.
24-06-24 03:20 AM
Hi @Karthik.choppadandi ,
Whenever you are working with a very large dataset like in terms of lakhs, you directly cannot rely on collections as Blue Prism is a 32-bit application (now, with v7.2+ we have 64-bit application) and if you try to store all this data in a collection, the internal memory will easily go out of space causing Blue Prism service to crash. The same concept also applies even if you use OLEDB connections however OLEDB is faster than your regular excel operations still the problem lies with the fact that you cannot store more than the required limit in a collection at a time.
Hence, the main problem to automate use cases with large dataset is that if you want to automate these use cases using Blue Prism's collection data item then you need to have a chunking mechanism while you operate with collections, that means, while you read the data from any external source (excel, database, access, web table etc.), you need to have some sort of logic which can break the data in terms of buckets and that you can store in collection and work with the same. So technically, even if you use OLEDB you would need some sort of chunking logic in your select queries. This can be a bit complicated and again the processing time will be a lot. For this reason, your best bet would be to somehow interact with the external application itself which in our case is Excel and get your operations work on that application without relying on collections itself.
@faheemsd has provided you with an excellent approach that can directly work within the excel file itself and since it uses dictionary it would be memory efficient from a time complexity approach.
On the same note, I would want to also share another approach that most often stays hidden but performance wise it is the best approach available out there and it has helped me a lot in past when I have worked with use cases similar to yours which is using Power Queries. I have a detailed guide on how it can be used with Blue Prism. For your use case though, I can share the redefined steps that can help you to implement it faster:
Creating Power Query Template File:
1) Create a blank excel workbook file and create a sheet called 'Parameters' where we will have the placeholders for storing our input values which in our case would be the file paths for both the excel data sources where you want to do the comparison and the sheet names belonging to each one of them.
2) Now, for this step ensure you have each of the data cell where the values will be entered for this power query file to work as named ranges. To create a named range, you just have to select the data cell and then in the address bar of excel as I have highlighted below you will need to write some name which can refer to this particular cell and press Enter (very important otherwise it does not get saved). You can think of it as variable names that we will use in further steps:
Similarly, I did the same for the other two cells which I would name as InputFilePath2 and InputSheetName2
3) Now for this step, we need to create a power query connection. To do this we need to navigate to Data -> Get Data -> Launch Power Query Editor:
4) Now, in the Power Query Editor, just to start with the use case, we need to add a recent source which for now we would need to manually add but later on I will show how we will make it dynamic from the Parameters sheet. Add both the sources for Sheet A and Sheet B, here I have just shown it for one sheet:
5) Once you have added your data sources, we would select the second data source and click on Merge Queries -> Merge Queries As New since we want another table with all the comparison results in a different view/sheet:
6) Now, in the Merge window, simply select the table A and table B, select the columns A1 and B1 where we want to perform a Merge operation, select Full Outer Join as we want to get all the rows from both the tables whether they match or not and click on OK:
7) This should give us a new table now, called as Merged1 and we can use it now for further transformation. At this step, also click on the expand button to expand all the columns from Table B:
😎 Now, we will simply right click on our columns and the table and rename them to make the transformation simpler in the next steps:
New Column Names: B1, B2
New Table Name: Final
9) We will now add a new column here to our merged dataset by selecting the last column and then navigating to Add Column -> Conditional Column:
10) In this step we will add the condition for this column called 'A' to be populated with the condition that if column A1 is null then, we should pick the data from column B1 otherwise we would take A1 data. Put the condition values as shown below for each highlighted box and click on OK:
11) Similarly, add another custom column as mentioned in step 9. We will now add another column called as 'B' with the condition that if column A1 is equal to column B1 then take A2 value otherwise if column B1 is equal to null then select column A2 value otherwise take B2 value and click on the OK button:
12) Now, we can remove the other columns as we do not need those. For this, we will select other columns and right click to select Remove Columns option:
13) Now, click on Close & Load option to save our transformation results:
14) Now, you can see we have all our transformed table presented in different sheets. Since, we do not want the A and B sheets, we can delete those:
15) Now, we have our desired data from Power Query transformation with one sheet called 'Parameters' and another called 'Final' which has the comparison results:
Making Power Query Dynamic:
1) Now, we need to make the Power Query file dynamic so that whenever the values are entered in the 'Parameters' sheet, the power query can take those values and use the file paths and sheet names from there. To do this, first open the power query editor by navigating to Data -> Get Data -> Launch Power Query Editor:
2) Now, select the first table and then we would go to Advanced Editor option. This will open up the editor window where we will have all the logic related to the Data Transformation steps which in our case include reading the data sources, comparing the columns and at the end producing the desired comparison results:
3) In the window, we will see the M-Query language code behind all the transformation rules. What we need to do is add the below code prior at the very beginning of the entire displayed code:
Code To Add At Beginning:
let
InputFilePath1 = Excel.CurrentWorkbook(){[Name="InputFilePath1"]}[Content]{0}[Column1],
InputSheetName1 =
Excel.CurrentWorkbook(){[Name="InputSheetName1"]}[Content]{0}[Column1],
We simply are declaring our variables here with respect to the named range values that I defined during step 2. Now, when you add these lines, the code will look like this:
Now, the same variable InputFilePath1 and InputSheetName1 we need to use in the next lines where we can see the hardcoded file path and sheet names and click on OK:
4) Now, we will do the same changes in B table with respect to InputFilePath2 and InputSheetName2. Select Table 2 and click on Advanced Editor to add the below code:
Code To Add At Beginning:
let
InputFilePath2 = Excel.CurrentWorkbook(){[Name="InputFilePath2"]}[Content]{0}[Column1],
InputSheetName2 =
Excel.CurrentWorkbook(){[Name="InputSheetName2"]}[Content]{0}[Column1],
We simply are declaring our variables here with respect to the named range values that I defined during step 2. Now, when you add these lines, the code will look like this:
Now, the same variable InputFilePath2 and InputSheetName2 we need to use in the next line where we can see the hardcoded file path and sheet name and click on OK:
5) You will see errors at this line but do not worry that is expected as we have not mentioned any valeus in the 'Parameters' sheet and now you simply click on Close & Load:
Testing Power Query File:
1) Now to test if everything is working fine with our file, we can delete the content of Final sheet and then go to Data -> Refresh All -> Connection Properties
2) Now, we will disable the option Enable background refresh and click on OK button:
This step is mandatory for us as in the next steps we will invoke the power query from Blue Prism and we need this option for it to work.
3) Now, go back to Parameters sheet, fill the values accordingly and click on Refresh All button. You should be able to see the data coming automatically in Final sheet:
Running Power Query File From Blue Prism:
1) Now go to Blue Prism and you can extend the 'MS Excel - VBO' or 'MS Excel -Extended' VBO whichever you wish to extend by creating another action called 'Refresh Power Query Connections' with input parameters as 'handle' (Number), 'workbook' (text) and output parameters as 'Message' (text) and 'Result' (Flag) as shown below:
2) In the code stage you need to insert the following parameters and the code as shown below:
Try
message = ""
Dim wb as Object = _
GetWorkBook(handle, workbookname)
' Refresh workbook to execute background queries
wb.RefreshAll
result = True
Catch ex As Exception
result = False
message = ex.Message
End Try
3) To run it now, you can simply create a workflow, to copy the Power Query file, enter the parameters using Write Cell actions and then invoke the newly created action as shown above to get results in the Final sheet of the Power Query file. We are creating a copy of the file since we want a fresh file for every run to ensure data is populated correctly although this is optional.
I mainly prefer Power Query as it is very fast for a large dataset and also easier to maintain as most of the actions are UI based only. Let me know if it helps and I can also share the file if that helps!
23-06-24 10:49 AM
Dear @Karthik.choppadandi
Collection doesn't work for the huge amount of data in Excel, you need to create a Custome code object and call the page in the process to get the result in Excel.
Please try the below code in code stage
Sub UpdateSheetB()
' Declare variables
Dim wsA As Worksheet, wsB As Worksheet
Dim lastRowA As Long, lastRowB As Long
Dim dict As Object
Dim i As Long
' Set worksheets
Set wsA = ThisWorkbook.Sheets("SheetA") ' Replace with the name of your first sheet
Set wsB = ThisWorkbook.Sheets("SheetB") ' Replace with the name of your second sheet
' Get the last row in each sheet
lastRowA = wsA.Cells(wsA.Rows.Count, "A").End(xlUp).Row
lastRowB = wsB.Cells(wsB.Rows.Count, "A").End(xlUp).Row
' Create a dictionary to store the values from SheetA
Set dict = CreateObject("Scripting.Dictionary")
' Loop through SheetA and add values to the dictionary
For i = 1 To lastRowA
dict(wsA.Cells(i, 1).Value) = wsA.Cells(i, 2).Value
Next i
' Loop through SheetB and update values
For i = 1 To lastRowB
If dict.exists(wsB.Cells(i, 1).Value) Then
wsB.Cells(i, 2).Value = dict(wsB.Cells(i, 1).Value)
End If
Next i
' Clean up
Set dict = Nothing
End Sub
Input Parameters:
SheetA - your Worksheet name
SheetB - your worksheet name
lastRowA - Row count of last row for SheetA
lastRowB - Row count of last row for SheetB
Please apply this code and see the result.
24-06-24 03:20 AM
Hi @Karthik.choppadandi ,
Whenever you are working with a very large dataset like in terms of lakhs, you directly cannot rely on collections as Blue Prism is a 32-bit application (now, with v7.2+ we have 64-bit application) and if you try to store all this data in a collection, the internal memory will easily go out of space causing Blue Prism service to crash. The same concept also applies even if you use OLEDB connections however OLEDB is faster than your regular excel operations still the problem lies with the fact that you cannot store more than the required limit in a collection at a time.
Hence, the main problem to automate use cases with large dataset is that if you want to automate these use cases using Blue Prism's collection data item then you need to have a chunking mechanism while you operate with collections, that means, while you read the data from any external source (excel, database, access, web table etc.), you need to have some sort of logic which can break the data in terms of buckets and that you can store in collection and work with the same. So technically, even if you use OLEDB you would need some sort of chunking logic in your select queries. This can be a bit complicated and again the processing time will be a lot. For this reason, your best bet would be to somehow interact with the external application itself which in our case is Excel and get your operations work on that application without relying on collections itself.
@faheemsd has provided you with an excellent approach that can directly work within the excel file itself and since it uses dictionary it would be memory efficient from a time complexity approach.
On the same note, I would want to also share another approach that most often stays hidden but performance wise it is the best approach available out there and it has helped me a lot in past when I have worked with use cases similar to yours which is using Power Queries. I have a detailed guide on how it can be used with Blue Prism. For your use case though, I can share the redefined steps that can help you to implement it faster:
Creating Power Query Template File:
1) Create a blank excel workbook file and create a sheet called 'Parameters' where we will have the placeholders for storing our input values which in our case would be the file paths for both the excel data sources where you want to do the comparison and the sheet names belonging to each one of them.
2) Now, for this step ensure you have each of the data cell where the values will be entered for this power query file to work as named ranges. To create a named range, you just have to select the data cell and then in the address bar of excel as I have highlighted below you will need to write some name which can refer to this particular cell and press Enter (very important otherwise it does not get saved). You can think of it as variable names that we will use in further steps:
Similarly, I did the same for the other two cells which I would name as InputFilePath2 and InputSheetName2
3) Now for this step, we need to create a power query connection. To do this we need to navigate to Data -> Get Data -> Launch Power Query Editor:
4) Now, in the Power Query Editor, just to start with the use case, we need to add a recent source which for now we would need to manually add but later on I will show how we will make it dynamic from the Parameters sheet. Add both the sources for Sheet A and Sheet B, here I have just shown it for one sheet:
5) Once you have added your data sources, we would select the second data source and click on Merge Queries -> Merge Queries As New since we want another table with all the comparison results in a different view/sheet:
6) Now, in the Merge window, simply select the table A and table B, select the columns A1 and B1 where we want to perform a Merge operation, select Full Outer Join as we want to get all the rows from both the tables whether they match or not and click on OK:
7) This should give us a new table now, called as Merged1 and we can use it now for further transformation. At this step, also click on the expand button to expand all the columns from Table B:
😎 Now, we will simply right click on our columns and the table and rename them to make the transformation simpler in the next steps:
New Column Names: B1, B2
New Table Name: Final
9) We will now add a new column here to our merged dataset by selecting the last column and then navigating to Add Column -> Conditional Column:
10) In this step we will add the condition for this column called 'A' to be populated with the condition that if column A1 is null then, we should pick the data from column B1 otherwise we would take A1 data. Put the condition values as shown below for each highlighted box and click on OK:
11) Similarly, add another custom column as mentioned in step 9. We will now add another column called as 'B' with the condition that if column A1 is equal to column B1 then take A2 value otherwise if column B1 is equal to null then select column A2 value otherwise take B2 value and click on the OK button:
12) Now, we can remove the other columns as we do not need those. For this, we will select other columns and right click to select Remove Columns option:
13) Now, click on Close & Load option to save our transformation results:
14) Now, you can see we have all our transformed table presented in different sheets. Since, we do not want the A and B sheets, we can delete those:
15) Now, we have our desired data from Power Query transformation with one sheet called 'Parameters' and another called 'Final' which has the comparison results:
Making Power Query Dynamic:
1) Now, we need to make the Power Query file dynamic so that whenever the values are entered in the 'Parameters' sheet, the power query can take those values and use the file paths and sheet names from there. To do this, first open the power query editor by navigating to Data -> Get Data -> Launch Power Query Editor:
2) Now, select the first table and then we would go to Advanced Editor option. This will open up the editor window where we will have all the logic related to the Data Transformation steps which in our case include reading the data sources, comparing the columns and at the end producing the desired comparison results:
3) In the window, we will see the M-Query language code behind all the transformation rules. What we need to do is add the below code prior at the very beginning of the entire displayed code:
Code To Add At Beginning:
let
InputFilePath1 = Excel.CurrentWorkbook(){[Name="InputFilePath1"]}[Content]{0}[Column1],
InputSheetName1 =
Excel.CurrentWorkbook(){[Name="InputSheetName1"]}[Content]{0}[Column1],
We simply are declaring our variables here with respect to the named range values that I defined during step 2. Now, when you add these lines, the code will look like this:
Now, the same variable InputFilePath1 and InputSheetName1 we need to use in the next lines where we can see the hardcoded file path and sheet names and click on OK:
4) Now, we will do the same changes in B table with respect to InputFilePath2 and InputSheetName2. Select Table 2 and click on Advanced Editor to add the below code:
Code To Add At Beginning:
let
InputFilePath2 = Excel.CurrentWorkbook(){[Name="InputFilePath2"]}[Content]{0}[Column1],
InputSheetName2 =
Excel.CurrentWorkbook(){[Name="InputSheetName2"]}[Content]{0}[Column1],
We simply are declaring our variables here with respect to the named range values that I defined during step 2. Now, when you add these lines, the code will look like this:
Now, the same variable InputFilePath2 and InputSheetName2 we need to use in the next line where we can see the hardcoded file path and sheet name and click on OK:
5) You will see errors at this line but do not worry that is expected as we have not mentioned any valeus in the 'Parameters' sheet and now you simply click on Close & Load:
Testing Power Query File:
1) Now to test if everything is working fine with our file, we can delete the content of Final sheet and then go to Data -> Refresh All -> Connection Properties
2) Now, we will disable the option Enable background refresh and click on OK button:
This step is mandatory for us as in the next steps we will invoke the power query from Blue Prism and we need this option for it to work.
3) Now, go back to Parameters sheet, fill the values accordingly and click on Refresh All button. You should be able to see the data coming automatically in Final sheet:
Running Power Query File From Blue Prism:
1) Now go to Blue Prism and you can extend the 'MS Excel - VBO' or 'MS Excel -Extended' VBO whichever you wish to extend by creating another action called 'Refresh Power Query Connections' with input parameters as 'handle' (Number), 'workbook' (text) and output parameters as 'Message' (text) and 'Result' (Flag) as shown below:
2) In the code stage you need to insert the following parameters and the code as shown below:
Try
message = ""
Dim wb as Object = _
GetWorkBook(handle, workbookname)
' Refresh workbook to execute background queries
wb.RefreshAll
result = True
Catch ex As Exception
result = False
message = ex.Message
End Try
3) To run it now, you can simply create a workflow, to copy the Power Query file, enter the parameters using Write Cell actions and then invoke the newly created action as shown above to get results in the Final sheet of the Power Query file. We are creating a copy of the file since we want a fresh file for every run to ensure data is populated correctly although this is optional.
I mainly prefer Power Query as it is very fast for a large dataset and also easier to maintain as most of the actions are UI based only. Let me know if it helps and I can also share the file if that helps!
25-06-24 08:46 AM
Thank you @faheemsd ..
25-06-24 08:49 AM
Thank you @devneetmohanty07
26-07-24 11:10 AM - edited 26-07-24 11:11 AM
I just wanted to pop in and say WOW - @devneetmohanty07 and @faheemsd amazing job! 2 very sound approaches with very clear steps on how to employ them.
And such a great question @Karthik.choppadandi - thank you for taking the time to ask. I hope you got what you needed!
On behalf of the community, thank you so much, great question and incredible answers.