on 01-08-24 03:10 PM - edited 4 weeks ago by Michael_S
Disaster Recovery (DR) ensures the smooth continuity of banking operations even in the face of disruptions such as natural disasters, cyber-attacks, or system failures. This continuity is crucial for maintaining financial stability, as any significant downtime or data loss could impact not only the bank’s services but also the broader financial system.
In addition, banks are required to adhere to strict regulatory standards set by the Central Bank for disaster recovery and business continuity. Effective disaster recovery plans help banks meet these legal requirements, avoiding potential fines and penalties.
The practical challenges that led to the requirement of this automation were:
We start by gathering and analyzing the AS-IS manual process documents from the application teams to define the scope of the TO-BE automation. A comprehensive study was conducted on the risk controls and governance aspects, focusing on the level of security controls needed to prevent accidental executions, credentials storage, and so on. Additionally, we utilized Blue Prism Capture for a couple of processes to capture various steps, screenshots, and exceptions, using these screenshots in the official approval document.
Based on the analysis, the RPA team proposed a TO-BE solution design to meet the requirements.
We decided to have a Master process with multiple Work queues to handle different DR execution types, such as Full DR, Partial DR, and Core Banking only DR. Additionally, to meet regulatory requirements, built a Human In the Loop approval process (Maker & Checker) to trigger automation via an Interact form.
The team used Agile methodology to build the system. One group of developers created the Master Process (the main template) to manage work queues, set priorities, trigger automation through web service API calls to the relevant BOT, write logs, send emails, and so on. Another team of developers worked on creating reusable, resilient objects for Linux automation, database automation, Windows automation tasks, and the actual development of the DR process.
Each release is thoroughly tested independently on the relevant UAT and staging server/nodes, and then integrated into the Interact form and tested through the Master process.
We followed our organization's delivery and change management practices to ensure smooth and seamless release/deployment. The initial few execution cycles involved active participation from application owners and the RPA team, and eventually, every release was handed over to the IT Operation team to manage future executions.
Below are the intangible benefits that were observed:
Here are a few main challenges observed:
Building the master process took us more time and effort. We had to change the process four to five times to accommodate different DR executions, for better handling, and optimization, etc. We are now considering using Blue Prism Desktop for a single app DR failover/failback activity instead of using the Master process.
Using Interact helps to achieve HITL - Human In The Loop automation for approval, to have human approval checkpoint on really critical steps/applications and for complex decisions.
An incredible project @sastharpa - thank you so much for sharing with us! You mentioned that you're using Interact for human-in-the-loop, is there anything you can tell us about what exactly the humans are inputting / checking in the process?
HI @Michael_S
The automation is fully based on HITL as it poses complex decision making and as a governance perspective for secondary approval. The journey starts where a process controller fills up the form (select appropriate DR type, CAB number (internal change request system) or application to trigger automation), the form has to be approved by an approver. And Humans involved after every critical system stop or start, technical owners/IT Operation has to confirm whether the previous system is completely stopped/started before the next critical system for dependencies. For instance, core DB can be switched to DR only if all the production systems are stopped and no active sessions/thread on DB.