When Jeff Bezos wrote his first letter to Amazon shareholders in 1997 he said, “But this is Day 1 for the Internet…” He has signed off every subsequent letter with “It remains Day 1!”
While this is mainly a statement of the philosophy of the leadership and strategy of Amazon, it’s also a literal take on where we are with adoption of the internet. The early days (for those who were there) of 56K dial-up via AOL (who?) have been superseded by massive disruption and innovation impacting our everyday lives.
And data centres are at the centre of the known universe.
The subscription economy that has already spawned Netflix, Zuora, Spotify and many others has already resulted in 89% of the British using a subscription service (source: YouGov, 2017). This new subscription or consumption model is not restricted to online services but more traditional services as well (Uber, Airbnb etc.) which are underpinned by booking access via the internet.
Humming away in the background of our 21st century lives are the data centres that feed our consumption through a myriad of broadband connections – neither of which we notice until they are not there or decide to run slowly.
Earlier this month Data Centre World 2019 was held at ExCel London with over 20,000 industry professionals in attendance.
With so much riding on data centre availability (or uptime), it’s no surprise that so much of the event was focused on this and, in particular, the associated operational risks (e.g. configuration of failover clusters). The average cost of a single data centre outage was estimated by PG&E to be more than $700,000 which is bad enough but the impact on SLAs and overall uptime statistics is much greater.
Not covered as much at the event was the impact of safety-related incidents on data centre availability. Perhaps, that was due to the overall good safety performance of the industry in recent times? With no fatalities in the UK data centre industry over the past 4 years and a better than industry-average non-fatal injury performance, it would be easy to dissociate operational risk from safety risk when looking at uptime/availability.
However, we believe the two are inextricably linked. Not only can safety incidents (e.g. arc-flashes) directly lead to downtime, but the drive for 100% availability can introduce and increase safety risks.
Firstly, let’s look at the inherent safety risks. Some, such as work at height, lifting and handling, and hot works are similar risks to those experienced in other industries. The first two are not reasonably going to cause an outage, but hot works might. If work can’t be done ‘cold’ or in a specific area for hot works then extreme care needs to be taken to prevent fire and a hot works permit requested and submitted for approval.
What is unique about data centres (excluding the fact they’re forecast to consume 1/5 of the world’s electricity by 2025) is the amount of electricity and stored energy that needs to be dealt with in any maintenance and repair activities. Isolations, de-energising, and LOTO all need to be carefully planned and permitted to ensure safe working.
Incorrect de-energising and a failure to isolate correctly have been the cause of several accidents. Arc flashes can occur and the intense heat and pressure caused are a constant threat that can lead to outages and worse. The real danger with arc flashes is that they can be initiated accidentally anywhere there is electricity. A risk assessment, or potential hazard analysis, won’t necessarily detect where/when an arc flash may happen.
The obvious solution to the risk of arc flashes is not to work on energised equipment. This is where the drive for 100% availability and a zero-risk approach to safety become mutually-exclusive. De-energising equipment means availability can only be maintained through redundancy. The Electricity at Work Regulations 1989 covers the use of risk assessments and formal authorisations to work on live electrical systems and the broad guidance is naturally to work on de-energised systems. However, the duty holder(s) can decide the exceptions that constitute working on live systems. Sensible exceptions include:
- where interrupting the electricity supply would endanger human life (e.g. safety lighting would be impacted)
- where testing can only occur if the equipment is energised
Unacceptable exceptions would be to maintain high levels of uptime for commercial reasons, e.g. where the appropriate levels of redundancy are not in place.
This is obviously a very difficult area with undoubted variance by geography. However, what is less difficult is the assertion that operational risk and safety risk are so intertwined as to be one and the same.