In the era of digital transformation, the stability and availability of IT services are becoming the most important prerequisites for operation of any large business. When you support tens of thousands of users, it is impossible to support their work without any systematic approach and round-the-clock response. Ivan Kuznetsov, head of Supporting Services Discipline, talks about how the CSC-IT situation centre, which monitors and responses to incidents 24/7, is organized. Ivan shares his experience of creating a structure that ensures prompt response and elimination of any problems.
Ivan, tell us how the CSC-IT Situation Centre appeared and why there was a need to create it?
First of all, the Situation Centre was created to provide round-the-clock support for IT infrastructure and information security. Serving more than 20 thousand users, we faced an obvious problem: standard response processes could no longer guarantee prompt resolution of incidents. The business expected to get not just support, but continuous readiness and instant response in any situation.
We quickly realized that effective response to emergency incidents requires a centralized structure that monitors the state of the infrastructure day-to-day and minimizes the impact of emergency incidents on business processes.
What are the key functions of the Situation Centre?
The Situation Centre performs a wide range of tasks. First of all, this is round-the-clock monitoring of the infrastructure and prompt response to emergency incidents. We don't just fix problems, we proactively identify them. For example, the analytics tools we've implemented allow us to predict potential failures and fix them before they impact users.
Support of the first line is another important function of the Centre. We close most tickets at this level, thanks to which the time to resolve tickets has been significantly reduced: 25% of user tickets are resolved in the first hour, and 50% are resolved on the day of the ticket. This became possible thanks to diagnostic checklists and manuals for specialists.
Besides, the Centre acts as a single entry point for all tickets. Users know that all they have to do is contact us and their tickets will be handled professionally and escalated automatically if necessary.
How do you ensure responsiveness and proactive incident management?
The effectiveness of the Situation Centre depends on the established processes and implemented technologies. We actively use monitoring systems that operate in real time and immediately alert the team of any deviations. However monitoring itself is just a tool. Analytics and pre-prepared solutions are more important.
We have diagnostic checklists and response scenarios for each type of a ticket and that allows us to quickly determine the cause and promptly notify everyone concerned. This not only speeds up the response time, but also reduces the workload on employees as they can quickly make a right decision.
For proactive management, we use predictive models that help identify potential problems. For example, analysing historical data allows us to identify areas where failures are more likely to occur and we can proactively take actions.
How does the Situation Centre support the business under high load conditions?
Our main objective is to ensure the continuity of key business processes. To do this, we actively apply three principles: redundancy, standardization, and automation.
First, we have disaster recovery plans (DRPs) that are regularly updated and tested. We test not only the recovery process itself, but also its speed to ensure that downtime is minimized.
Secondly, we standardized all processes. This applies not only to incident and emergency management processes, but also to related tasks such as procurement or change requests and problem management. This approach eliminates errors and reduces the time required to complete typical operations.
Finally, automation. We have implemented tools that not only track incidents, but also help us manage them. For example, notifying responsible employees about the beginning and end of an emergency within 15 minutes, which allows you to quickly inform the business about the current status.
What skills does your team need to provide this high level of service?
Working in the Situation Centre requires high stress tolerance, excellent analytical skills and a deep understanding of infrastructure. However, even these qualities are not enough if there is no well-developed system for training new employees.
We have developed detailed instructions for all processes and implemented a mentoring system. This allowed to significantly reduce the training time of a newcomer and quickly engage him/her in work.
In addition, teamwork is important. Every specialist knows that in case of a complex incident he/she can count on the support of colleagues. This mutual exchange is one of the key factors for the successful operations of the Centre.
What is the role of automation in IT processes management?
Automation is not just a tool but it is a basis of the Situation Centre operations. We automated the majority of routine processes letting our team focus on complex tasks.
For example, we use diagnostics scripts to analyse a problem and they propose ways to handle it. It helps minimise diagnostics time and time to resolve incidents.
Automated systems of change, problems and procurement management also integrated into our infrastructure. This not only improves accuracy, but also speeds up the task execution.
What achievements do you consider the most important ones?
The main achievement for us is the stable operation of the Centre in 24/7 mode. We launched the first support line, organized monitoring and achieved impressive results: the level of satisfaction of our services reached 98%.
We also reduced the response time for incidents: 25% of tickets are resolved in the first hour, 50% are resolved on the day of the ticket. Increased rebuilding speed after incidents, keeping minimum downtime for the business: we need less than 45 minutes for a complete system recovery. We have also reduced the number of emergencies due to proactive work - in the last year this number decreased by 30%.
It is important to note, we have a clear understanding of how to anticipate and address issues before they affect the business. We have also improved the level of alerting, which allows our users to be aware of all changes and increases their confidence in what we are doing.
What are the directions of development of the Centre you see for the coming years?
In the near future, they are further automation of processes and integration of new analytical tools. We are striving to increase the proactivity of the Centre so that the business can be sure: its support is not just a quick reaction, but the complete elimination of any risks before they arise.
Separate direction is launching a project based on artificial intelligence, which will help to predict emergencies and reduce the number of tickets. This tool will allow us analyse data even more accurately, identify latent dependencies and take preventive measures.
The Situation Centre continues developing, but it has already become an important tool in supporting the business in a high-load and constant changing environment. Ambitious plans require a lot of effort, and we are sure that they will bring results that are even more impressive!