While IT managers dread system downtime, the harsh reality is that even the best plans and preparation cannot prepare for every circumstance. Pieter van der Merwe, Availability Solutions Architect, Africa & Middle East at Stratus Technologies, says here it is often the simplest oversights that can quickly escalate into serious events that are difficult and often costly to remedy.
“In many instances downtime is the result of some type of human error. Here it is often the environment that is overlooked as opposed to underinvestment in software or hardware availability solutions,” he says.
Van der Merwe cites an example of a situation where having a back-up for something as simple as the air conditioning unit in the server room often does not feature in business continuity plans.
“While protecting your organisation’s hardware and software in every way possible is of utmost importance and cannot be overemphasised, it is often the unrelated or seemingly insignificant environmental factors that come back to bite you.
“In the case of an air conditioning unit in a server room should it malfunction or shut down this will inevitably cause your server to overheat and shut down resulting in critical downtime that could have a material or reputational impact on an organisation,” he comments.
In addition to environmental factors, Van der Merwe says complexity and a lack of understanding of the set up as a whole can also contribute to downtime. “For instance substantial investment may have been made in a fault tolerant server but lack of planning or oversight within the illogical layout of the data centre sees even the best trained people making mistakes,” he adds.
“Almost all servers today are ‘dual corded’ where they are connected to two different power supplies. In this scenario it is possible for an electrician to mix up the power supply feeds resulting in a situation where the power supplies connected to a server and communication devices are not synchronised, making your applications vulnerable to downtime,” he explains.
In addition to complexity, Van der Merwe says bad planning is another contributing factor to downtime. “A good example here is the case where an organisation in Nigeria was conducting routine maintenance and in error the diesel generator was switched to manual mode as opposed to automatic.
“The problem was compounded by the fact that the next day was voting day and there was restriction of movement. As a result staff were unable to get to the site to rectify the situation with the battery of the UPS eventually running flat and causing a power failure and outage,” he comments.
In terms of implementing measures to prevent downtime, Van der Merwe stresses the importance of simplifying the environment. “The more complex the environment, the longer it will take to rectify or recover from downtime. Here it is advisable to troubleshoot and determine core functionality and where implementing fault tolerant solutions may be necessary. Ideally organisations should look to implement active/active functionality ensuring a high availability cluster is put in place.
“In addition organisations should ensure that its operating systems are up to date and that all applications are simplified and multiple versions are not being run. Patches should also be applied as required to fix errors or vulnerabilities in software. Being consistent in your labelling and use of colour codes may seem obvious but very often this is lacking and confusion and mistakes start to creep in,” he adds.
For Van der Merwe, also important to remember is that people and process work hand-in-hand. “It is critical to test any new system in a test environment prior to implementing into production. This ensures that any instability which may impact performance is picked up and can be rectified. Here not investing in rectifying the weakest link will inevitably cause the system to fail, resulting in downtime. In today’s always-on world this is a risk most organisations will not be prepared to take,” he concludes.