cancel
Showing results for 
Search instead for 
Did you mean: 

Remove duplicates from collection with key field

jgregor4
Level 6
I am trying to remove rows from a collection where there are duplicates in the collection but based on a specific column.
I have an action that will remove duplicates but this only works if everything in the row is duplicated.

InputCollection.AcceptChanges()
OutputCollection = InputCollection.Defaultview.ToTable(True)

InputCollection = Nothing
​

Is there a way to modify this so that i can specify a column to be used to check for the duplicate entry and then remove the row if it is leaving just unique entries?

------------------------------
James Gregory
RPA Developer
Europe/London
------------------------------
5 REPLIES 5

John__Carter
Staff
Staff
Hi James - you can do it with a DataView if you specify the columns that form a distinct row. It's not ideal because here you're not saying 'give me all rows where column X is unique', instead you're saying 'give me unique rows made of columns A,B,C...Z'. Basically you have to provide a list of all the column names you need in the output.

Dim v As New System.Data.DataView(Collection_In)
Collection_Out = v.ToTable(True, Column_Names_CSV.Split(","))

On the other hand, this technique is also a handy way to get a subset of columns from a collection.

------------------------------
John Carter
Professional Services
Blue Prism
------------------------------

I don't think I explained properly what the issue is.
There is a form whereby the submit button doesn't disable when pressed and so duplicate requests are being received.
Each request has a unique ID but the reference number and name will be the same.
I cant use the standard "Remove Duplicates" as the ID is different for each request:

ID | Ref | Name
1  | abc | John
2  | abc | John
3  | xyz  | Mike

So I want to have something where I can same, remove duplicates where Ref is not unique which would output the collections as:

ID | Ref | Name
1  | abc | John
3  | xyz  | Mike



------------------------------
James Gregory
RPA Developer
Europe/London
------------------------------

Hi James,

I've had a quick play around with the existing Utility - Collection Manipulation and have modified the Remove Null Rows action to do what I think you are after. Here's the code:

'Set local variables
dim iRow as integer
dim count as integer
dim rows as integer

Try
	'Initialise row counts and prepopulate output table
	rows = Collection_In.Rows.Count - 1
	Collection_Out = Collection_In
	iRow = 0
	
	'Loop through each row in the input collection
	Do While iRow < rows
		'Reset counter
		count = 0
		
		'Select all rows in the output collection which match the specified value in the input collection
		For Each row2 As DataRow in Collection_Out.Select( String.Format("[{0}] ='{1}' or trim([{0}]) = '{1}'", Key_Field,Collection_In.Rows(iRow).Item(Key_Field)))
			'If this row is the first found, move on to the next one, otherwise delete it
			If count > 0 then
				'Delete duplicated row
				row2.Delete()
				'Reduce the total number of rows in the output collection
				rows = rows - 1
			End If
			'Increase the count of duplicates
			count = count + 1
		Next
	'Increase the count of rows being checked
	iRow = iRow + 1
	Loop
	
	Success = True

Catch ex as Exception
	Error_Message = ex.ToString()
	Success = False

End Try

​


The basic premise is to copy the input collection to the output collection, then loop through the input collection and search the output for duplicates. Then, delete any duplicates from the output and adjust the count values to suit. I've tested this against a small dataset and it seems to work as expected. I've also attached the full object (including my test dataset in the action) to set you on your way.

Hopefully this helps, but feel free to ask for clarification if anything isn't clear or quite right!

Kind regards,



------------------------------
Sam Stone
CoE Build Lead
Blue Prism
------------------------------
Sam Stone CoE Build Lead [CompanyName]

Hi Sam, I tried to used this code today but am getting an error. The action worked for only 1 set of duplicates, and deleted the newest rather than the oldest row. It didn't work for the other duplicate rows. Also in my case, another field determines which should be kept. 
Thanks for helping and taking this on in the first place 🙂

System.Data.SyntaxErrorException: Syntax error: Missing operand after 's' operator.
at System.Data.ExpressionParser.Parse()
at System.Data.DataExpression..ctor(DataTable table, String expression, Type type)
at System.Data.DataTable.Select(String filterExpression)

------------------------------
Bernadette Stroeder
S&P Global
------------------------------

Hi @BernadetteStroe

Are you using the ​code above?

It's looks like a problem with this line:
For Each row2 As DataRow in Collection_Out.Select( String.Format("[{0}] ='{1}' or trim([{0}]) = '{1}'", Key_Field,Collection_In.Rows(iRow).Item(Key_Field)))​


Do you have any " ' " character in your name field?

You are concatenating, so, if any of the values that you are going to replace with the String.Format() function contains " ' " you will break the SQL Syntax, for example:

"[field] ='value's' or trim([field]) = 'value's'"


Let us know if this solves the problem.


Hope this helps you!



See you in the community,
Bye 🙂



------------------------------
Pablo Sarabia
Architect
Altamira Assets Management
Madrid
634726270
------------------------------