Managing human risk to the data centre
Unfortunately, human beings are the larget single point of failure in the modern data centre. Human error still accounts for a large proportion of outages, and not enough is being done to address the issue.
Findings from a recent survey from the Uptime Institute show that outages are becoming more frequent and more expensive. Now is the time for the industry to start paying attention and look for ways to reduce highly preventable downtime.
When humans are involved, mistakes are going to happen. It’s inevitable. However, simply accepting the risk that humans make mistakes is a mistake in itself, as the frequency and cost of human error-related outages continues to grow. Processes and rigorous professional development plans must be the norm.
We all know that outages are hugely costly for organisations. It’s not just the financial impact of downtime, the detrimental effect an outage can have on brand reputation, customer confidence and perceived compliance can be even more devastating.
Organisations need to learn from their previous outage experiences and help to mitigate against the inevitable human error probability by continually assessing and reviewing and enhancing the skills set of their teams. Investing in staff education, training and personal development can pay significant dividends, educated/officially certified individuals could potentially save their employer millions by doing things right the first time, and therefore mitigating the possibility of an outage, or recognising and resolving an issue early and preventing it from becoming more serious in the future.
The 2020 Uptime Institute’s global data center survey 2020 reveals that 78% of organisations have stated that they have had an IT-related outage in the last three years with 75% saying that their most recent outage could have been prevented with better management, making a large proportion of outages the result of human error. This figure has increased by 15% since 2019.
It's unwise for organisations to simply accept that preventable outages are an acceptable fact of life, especially when they are also growing more costly. The Uptime Institute survey also reveals that in 2020 nearly one in six outages cost more than $1mn, as opposed to one in ten in 2019. Also, an increased percentage cost between $100,000 and $1mn.
If organisations get better at spotting the knowledge, competency and skills gaps in their teams and invest to fill these gaps, whilst ensuring the processes and procedures are kept up to date, the picture could be very different.
With industry supported education programs awarding official certifications and qualifications, alongside advances in individual and team analytical tools, backed by science and psychological methodology that identifies exactly where knowledge, competency and even confidence levels are lacking, there are numerous opportunities for organisations to take important steps towards mitigating human risk.
It’s industry best practice to regularly test and monitor the lifecycle of mission critical equipment. As an industry, we service our technical equipment to check it is still functioning as expected and plan its future lifespan and renew or restore to prevent outages. The same thinking needs to be applied and in place for the teams working in data centres.
The individuals responsible for the outages are not individuals looking to sabotage, they are usually experienced members of the technical team that for one reason or another are not following processes or have knowledge, competence or confidence gaps. It’s a fact that if people have been doing the same job for an extended period, their confidence can take over and this can cause individuals to overlook details and specific processes which in turn can cause catastrophic failures – they could be confidently doing things wrong.
One of the big challenges organisations face is that continual professional development budgets are usually limited or cut to boost other areas of the organisation. There is also a common misconception about education/training allocation, as these activities are often used to provide a reward to those most loyal or high performing people, rather than those who actually need it the most. This misconception results in the employees gaining very little from the development activities and therefore provides little or no benefit or ROI to the organisation itself. The risk data centre operators are taking by not investing in their people is massive, and could cost them thousands per minute during an outage.
The Uptime Institute survey also states that with more investment in management, process and training, that the outage frequency would almost certainly fall significantly. Hopefully, this will raise alarm bells to the rest of the industry to turn their attention to these areas. The pandemic has highlighted the critical importance of the digital infrastructure industry and demand is only going to increase.
Alongside an increasing skills shortage and an ageing workforce, this is a stark warning that if organisations don’t develop, train and invest in teams effectively throughout the entire workforce, outages are likely to become bigger and more expensive.
With an aging workforce, many experienced industry professionals will soon be looking to retire. With decades of industry and on-the-job experience, are those team members that will be taking their place really sufficiently trained, experienced and ready to handle any future issues that might arise? Organisations need to address the problem head on instead of waiting for things to start going wrong.