06-08-21 09:59 AM
Data Gateways Tech Tip
When using Data Gateways to output data to a database we need a table in a database other than the Blue Prism database with the below definition, the table can be called anything, in this example I've called it DataGatewayDestination.
CREATE TABLE [dbo].[DataGatewayDestination](
[EventType] [int] NULL,
[EventData] [nvarchar](MAX) NULL
) ON [PRIMARY]
GO
The scripts used in the remainder of this post contain database and object names that are specific to my local "test" instance of SQL Server. These scripts are meant as an example of how to achieve improvements in both querying and retaining data in a Data Gateways database destination. As such;
*** The scripts are to be tested in lower route to live environments (Development / Test / UAT / SIT) and the results verified before being run against a Production environment ***
*** Always make sure that you have backups in place that can be used to recover any data that may be removed ***
THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF
ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED
TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
PARTICULAR PURPOSE.
On my local instance of SQL Server, I have an MIRepository database in which I store my Data Gateways database destination data. If we use the below query the data looks like this.
SELECT [EventType],[EventData]
FROM MIRepository.dbo.DataGatewayDestination;
The value of the EventType column is defined in the documentation as below.
1 = Session Log
2 = Published Dashboard
3 = Custom
4 = Work Queue Analysis
The EventData column data is stored as a JSON string of key value pairs, so to just return the session log records we can use the below query.
SELECT [EventType],[EventData]
FROM MIRepository.dbo.DataGatewayDestination
WHERE EventType = 1;
But what if there are millions of records how can we efficiently filter the EventData column? The answer is "It Depends". If the table resides in a database on a SQL Server Instance running either SQL Server 2016 and later, Azure SQL Database or Azure SQL Managed Instance then it is quite simple, we can use the JSON_VALUE() function (Prior to SQL Server 2016 this is much more difficult and possible methods are omitted from this post). The query below uses the JSON_VALUE() function to check the StartDate key and return all records greater than 01/01/2021;
SELECT [EventType],[EventData]
FROM MIRepository.dbo.DataGatewayDestination
WHERE EventType = 1
AND JSON_VALUE([EventData], '$.StartDate') > '2021-01-01T00:00:00+00:00';
We can even use date manipulation; the below query converts the value of StartDate key to a DATE data type and returns all records older than 183 days.
SELECT [EventType],[EventData]
FROM MIRepository.dbo.DataGatewayDestination
WHERE EventType = 1
AND CONVERT(DATE,JSON_VALUE(EVENTDATA, '$.StartDate')) < GETDATE()-183;
More so, we can even use wildcard searches, the below query returns records where the value of the ResourceName key is LIKE "BPEU5".
SELECT [EventType],[EventData]
FROM MIRepository.dbo.DataGatewayDestination
WHERE EventType = 1
AND JSON_VALUE([EventData], '$.ResourceName') LIKE '%BPEU5%';
Out of the box Data Gateways allows you to export operational Blue Prism data but there is no functionality to help maintain it once it has been exported. As this data can contain session log information the storage requirements can often be in the Terabytes (1 Terabyte = 1024 gigabytes) depending on how long we retain the data for. I have seen customers with a short 1-month retention period and several hundred gigabytes of data in their session log table. Imagine then using Data Gateways, exporting the data and leaving it unchecked, it can quite easily become a storage and administrative burden. Retaining 200GB of session log data per month in a Data Gateways database destination for 12 months would require 2.4TB of disk space, keep the same data for 7 years and that is a whopping 16.8TB! The JSON_VALUE() function is also an excellent option when it comes to maintaining this data we can write DELETE queries to remove historical data that is no longer needed, which will help keep the size of the data down at an acceptable level depending on requirements. The below query deletes all session log data from the Data Gateways database destination table older than 183 days.
DELETE FROM MIRepository.dbo.DataGatewayDestination
WHERE EventType = 1
AND CONVERT(DATE, JSON_VALUE(EventData, '$.StartDate')) < GETDATE() - 183;
Another option which is much more efficient is to add an additional InsertedDate column with a default of GETDATE(). This will allow for easy clean-up of all EventTypes and works on all versions of SQL Server as we no longer need to use the JSON_VALUE() function to identify the records to delete. The Data Gateways database destination table schema would then look like the below.
CREATE TABLE [dbo].[DataGatewayDestination]
(
[EventType] [INT] NULL,
[EventData] [NVARCHAR](MAX) NULL,
[InsertedDate] [DATETIME] NOT NULL
CONSTRAINT [DF_DataGatewayDestination:InsertedDate]
DEFAULT (GETDATE())
) ON [PRIMARY];
GO
If we already have a Data Gateways database destination, we can still make use of the additional column but instead of creating the table we simply extend the existing table definition and add the new InsertedDate column but make the column nullable as below.
ALTER TABLE [dbo].[DataGatewayDestination]
ADD [InsertedDate] [datetime] NULL CONSTRAINT [DF_DataGatewayDestination:InsertedDate] DEFAULT (GETDATE());
GO
This will allow the InsertedDate to be populated for all new records whilst existing rows are left NULL. Now to delete all records older than 183 days the query looks like the below.
DELETE FROM MIRepository.dbo.DataGatewayDestination
WHERE InsertedDate < GETDATE() - 183;
Before we do so however we can benefit from adding a NONCLUSTERED INDEX on the InsertedDate column to improve the efficiency of how SQL Server identifies the records to be deleted, for this we can use the below.
CREATE NONCLUSTERED INDEX [IDX_DataGatewayDestination:InsertedDate] ON DataGatewayDestination (InsertedDate);
GO
Once the minimum date of any newly inserted records exceeds the retention period, we can then clean up the NULL records too using the below query.
DELETE FROM MIRepository.dbo.DataGatewayDestination
WHERE InsertedDate IS NULL;
At this point we have a Data Gateways database destination data table which has an additional InsertedDate column to store the date the record was inserted and a non-clustered index on the InsertedDate column which will help SQL Server in finding our records, but we still have room for improvement. We can add a second additional column to use as a Primary Key, a Primary Key is used to uniquely identify a record in a table within SQL Server, to do this the Data Gateways database destination table schema complete with non-clustered index would then look like the below.
CREATE TABLE [dbo].[DataGatewayDestination]
(
[DataGatewayDestinationID] BIGINT IDENTITY(1, 1)
CONSTRAINT [PK_DataGatewayDestination:DataGatewayDestinationID] PRIMARY KEY,
[EventType] [INT] NULL,
[EventData] [NVARCHAR](MAX) NULL,
[InsertedDate] [DATETIME] NOT NULL
CONSTRAINT [DF_DataGatewayDestination:InsertedDate]
DEFAULT (GETDATE())
) ON [PRIMARY];
GO
CREATE NONCLUSTERED INDEX [IDX_DataGatewayDestination:InsertedDate] ON DataGatewayDestination (InsertedDate);
GO
An added benefit is that by default a Primary Key also creates a clustered index across the primary key column(s), where SQL Server will physically sort and store the records based on the key column(s) to further improve the efficiency at which SQL Server can access the data. These two additional columns are completely transparent to the Blue Prism Data Gateways engine yet can give us additional performance gains and more importantly make managing the data much easier.
I hope you find this Tech Tip useful and it has given you some thoughts about working with Data Gateways data once it has been exported to a database destination.
#BPTechTips
06-08-21 12:22 PM