Digital worker-driven Disaster Recovery Automation

sastharpa · ‎01-08-24

Automation Use Case: Making sure our organization is always ready to recover from a disaster

In this Blueprint.png

SS&C Blue Prism MVP Ganesh describes how he used digital workers to ensure continuity of banking operations during unpredictable disasters.

Products Used: SS&C Blue Prism Enterprise, Capture, Interact
Target department: Information Technology
Sector: Financial Services

Disaster Recovery (DR) ensures the smooth continuity of banking operations even in the face of disruptions such as natural disasters, cyber-attacks, or system failures. This continuity is crucial for maintaining financial stability, as any significant downtime or data loss could impact not only the bank’s services but also the broader financial system.

In addition, banks are required to adhere to strict regulatory standards set by the Central Bank for disaster recovery and business continuity. Effective disaster recovery plans help banks meet these legal requirements, avoiding potential fines and penalties.

The practical challenges that led to the requirement of this automation were:

Assembling all the human resources (application owners) for the activity is often a big challenge, requiring significant coordination efforts.
Another significant challenge is employee satisfaction. These activities often occur at odd times (usually 2 AM to 5 AM) when the bank's network traffic is at a low volume. Employees find it boring and tiring when these activities happen at regular intervals.

The discovery phase

We start by gathering and analyzing the AS-IS manual process documents from the application teams to define the scope of the TO-BE automation. A comprehensive study was conducted on the risk controls and governance aspects, focusing on the level of security controls needed to prevent accidental executions, credentials storage, and so on. Additionally, we utilized Blue Prism Capture for a couple of processes to capture various steps, screenshots, and exceptions, using these screenshots in the official approval document.

Designing the automation

Based on the analysis, the RPA team proposed a TO-BE solution design to meet the requirements.

We decided to have a Master process with multiple Work queues to handle different DR execution types, such as Full DR, Partial DR, and Core Banking only DR. Additionally, to meet regulatory requirements, built a Human In the Loop approval process (Maker & Checker) to trigger automation via an Interact form.

Building the automation

The team used Agile methodology to build the system. One group of developers created the Master Process (the main template) to manage work queues, set priorities, trigger automation through web service API calls to the relevant BOT, write logs, send emails, and so on. Another team of developers worked on creating reusable, resilient objects for Linux automation, database automation, Windows automation tasks, and the actual development of the DR process.

Testing and improving

Each release is thoroughly tested independently on the relevant UAT and staging server/nodes, and then integrated into the Interact form and tested through the Master process.

Moving to production

We followed our organization's delivery and change management practices to ensure smooth and seamless release/deployment. The initial few execution cycles involved active participation from application owners and the RPA team, and eventually, every release was handed over to the IT Operation team to manage future executions.

The Outcome.png

Below are the intangible benefits that were observed:

Increased employee satisfaction
The last few activities were executed with only a few critical personnel, i.e., with only 5 employees instead of the usual 20+ employees.
Always-ready
Business-critical applications' failover and failback steps and procedures are well documented and always in a DR-ready state in case of a real disaster scenario.

Lessons Learned.png

Here are a few main challenges observed:

Difficulty in setting up new environments to simulate Production & DR nodes on UAT & staging.
Gaps between production and UAT environment, some steps were irreproducible due to licensing and technical issues.
Some developed and tested processes were left untested and unexecuted due to business criticality and downtime issues.
Despite a proper change management policy, a few application changes at the server level went unnoticed and caused failures in subsequent automation executions.

Building the master process took us more time and effort. We had to change the process four to five times to accommodate different DR executions, for better handling, and optimization, etc. We are now considering using Blue Prism Desktop for a single app DR failover/failback activity instead of using the Master process.

Using Interact helps to achieve HITL - Human In The Loop automation for approval, to have human approval checkpoint on really critical steps/applications and for complex decisions.

If this automation use case inspired you: Reward the author by clicking the Like button
If you would like to know more: Reply to this post and ask your question
If you're still looking for examples of intelligent automation use cases: Browse or search our use case library

Michael_S · ‎05-08-24

An incredible project @sastharpa - thank you so much for sharing with us! You mentioned that you're using Interact for human-in-the-loop, is there anything you can tell us about what exactly the humans are inputting / checking in the process?

sastharpa · ‎14-08-24

HI @Michael_S

The automation is fully based on HITL as it poses complex decision making and as a governance perspective for secondary approval. The journey starts where a process controller fills up the form (select appropriate DR type, CAB number (internal change request system) or application to trigger automation), the form has to be approved by an approver. And Humans involved after every critical system stop or start, technical owners/IT Operation has to confirm whether the previous system is completely stopped/started before the next critical system for dependencies. For instance, core DB can be switched to DR only if all the production systems are stopped and no active sessions/thread on DB.

SS&C Blue Prism Community