Are ITEE engineers getting a bad rap? Tuesday, 16 February 2016

News article written by Corbett Communications. The statements made or opinions expressed do not necessarily reflect the views of Engineers Australia.

Whether Australia’s mobile, broadband or banking networks go down due to “an embarrassing human error” as Telstra recently declared, a botched upgrade or other technology failure, use of the word “outage” has come into the vernacular with startlingly frequent use, as well as the subsequent corporate apology. But are the engineers behind the networks getting a bad rap where the blame may lay more in corporate decision-making?

Just over a week ago, Telstra mobile customers experienced a mass outage across a number of states, leaving them without any service due to a “node malfunction”. Ten nodes across Telstra enables the company to manage traffic and connections for voice and data right across Australia, chief operations officer Kate McKenzie revealed.

“Normally we could take down three or four of those nodes but on this occasion the correct procedure was not followed,” she said. "The outage was triggered when one of these nodes experienced a technical fault and was taken offline to fix.”

The Telstra outage resulted in disconnections and heavy congestion on the nodes that remained, causing in some areas, potential life-threatening situations such as in Western Australia’s Myalup fire zone where the SMS and online warning system went down.

ITEE College chair Geoff Sizer said blaming the recent Telstra outage simply on human error on the part of an individual undertaking maintenance activities overlooked key issues.

“Having a system design which is apparently vulnerable to a single point of failure is inconsistent with sound mission-critical system design and risk management practices,” he said. 

“Assuming that this was a known vulnerability, the maintenance action should have been supervised and cross-checked. It is also of concern that when it occurred, the problem seems to have taken excessive time to recognise and rectify.”

In December, also in WA, almost 46,000 iiNet customers were without the internet due to “a major fault”, disconnecting them and leaving them without broadband service for up to 36 hours.

"Engineers are continuing to work with our vendor to identify the root cause of the issue," the website said. "Works are also currently underway to redirect affected customers to alternate networking hardware to reduce impact," iiNet stated at the time.

Telecommunications engineer and Immediate Past Chair of the ITEE College Peter Hitchiner told Monitor that attention to risk management is an essential part of corporate governance; that there is no endeavour without risk so engineers design to minimise risk to what is affordable. 

“The community has grown accustomed to the relying on telecommunications and data networks so the impact of failures is that much more pronounced. There is a cost associated to lowering risk whether it be in making the processes and procedures more robust or in making the networks increasingly resilient to failures,” Hitchiner said.

“If the community wants the very minimum of service unavailability designed to mitigate to the maximum extent the risk of failure it may cost more than the user is prepared to pay. There's a balancing act here and engineers are best placed to manage the risk issues.”

Optus mobile services also suffered a “major” network outage in July 2015 that affected thousands of customers across NSW, Victoria and Tasmania that lasted for at least three hours. An earlier outage in June impacted customers in NSW for several hours, after a supposed botched upgrade to its network.

"Overnight, Optus carried out some upgrade works to our network which may be causing intermittent interruptions for customers. Our technicians are working to restore services as quickly as possible. We apologise to customers and appreciate their patience."

Major outages have been a regular occurrence for the Commonwealth Bank and to a lesser degree, the other major banks, but the impact felt by customers and businesses can be enormous as in the outage concerning ATMs and credit cards throughout the October long weekend with St George.

“The problem has arisen following a regular upgrade to the bank’s computer systems over the weekend … and apologise to customers for the inconvenience that this incident has caused.”

ANZ Bank’s post to its Facebook page in November summed it up for all banks and telcos in Australia: "Our sincere apologies for the inconvenience … thank you for your patience."