Vincent Renaud, PE
Principal, Blackdog CFS
The idea of a High-Reliability Organization (HRO) is nothing new. In fact, it has been around for many years. These principles embody the methods and outcomes for submarine crews, fire fighting crews, and Special Operations Teams. However, applying the principles of a High-Reliability Organization to the development and continued operation of a data center forms the basis for the Blackdog CFS approach.
A High-Reliability Organization is defined as an organization that has succeeded in avoiding catastrophes in an environment where normal accidents (or failures) can be expected due to risk factors and complexity. You may say “hey, I don’t have high risk nor complexity in my data center!” Well, you most certainly do. What is the price of a downtime event in your data center – to you and your customers? Be assured that the bigger impact is through the social or political ramifications than it may be the immediate financial impact.
HRO is a key ingredient of your data center’s governance. Governance is simply how the site is managed, how to protect the overall data center investment, and how to achieve continuous availability goals.
Exactly what is an HRO? Three characteristics best define it:
Let’s review these in detail as it applies to the Data Center Operations team:
A Fierce Commitment to a Common Objective: The common objective is the availability of the entire facility and IT platform. This means that each operator, technician, manager, supervisor, or senior manager depends on each other to do their job. Thorough, tested, and understandable procedures are the foundation – following them to the letter and holding each other accountable to do so is the follow-through required to ensure success.
A Preoccupation with Failure: This is proactive thinking instead of reactive and is critical when dealing with a data center. The operations team should regularly be anticipating what are those actions or conditions are that could lead to an outage. Predictive Maintenance is demanded by this characteristic. Even the seemingly innocent act of technicians going into the computer room could be the catalyst to multiple occurrences that could lead to an outage. Constant assessment to address what is lurking in the infrastructure, operations scheme, or upcoming maintenance actions embodies this characteristic.
Unparalleled Attention to Detail: Ever heard the phrase “don’t sweat the small stuff”? Well, when dealing with data centers, you MUST concentrate on the minutia and small details that make up the larger picture of a successful data center portfolio. It is the very small details that will make or break your operation. For example, what position is that valve in? What is the normal or standard configuration of the site infrastructure? Is the labeling of distribution paths and capacity components truly unique and easily understood by the operators (who cares what the design engineers think!).
How to Apply HRO Principles to your operation
Applying the principles and characteristics of a High-Reliability Organization to your data center portfolio takes commitment. It could be considered “cultish” as it represents a way of life. It should consume every minute that the team is on-site (and even much of that while not there). It starts with a disciplined Governance Program. Such an overarching program details operational objectives, policies, and protocols that are aimed at reducing operational risk – and gets your management on board. Reviewing some of the obvious elements of a successful operation, let’s see what can be done:
a. Vendor Training – training is conducted on specific subsystems throughout the site. This is how specific equipment is to perform individually.
b. Tabletop – Sitting at a round table and posing situations to the staff. This gives a good insight into if they really do know the infrastructure or not.
c. Mock Drills – Posing situations to staff and having them actually walk through how they would respond. Typically works for Emergency Response Procedures.
d. System Awareness – Many operators are trained well by the vendors to understand individual subsystems. It is important, though, that each operator understand how the entire data center infrastructure works. Understanding the interrelationships between electrical and mechanical systems is paramount. This is one of the most common shortfalls of critical facility operations staff.
There are other important parameters involved in the entire data center life cycle that HROs can be applied to. Most of all, it should be the basis for how the entire team approaches every aspect of the data center – this includes both facilities and IT. Certainly, sitting back and letting things – or other organizations – dictate how and what takes precedence is defeating. Instead, develop a plan – and execute! Take charge and ensure that the entire team understands the importance of a deliberate and well-planned operations and management scheme guided by a rigorous Governance architecture.
About Us: Blackdog CFS (www.blackdogcfs.com ) is a Veteran Owned Business that provides industry-leading data center infrastructure and management consulting services. Our goal is to utilize our time-proven approach to critical facility operations to help your company realize your critical facility infrastructure operational continuity objectives.