Increased Risk of IT Failure

Risk management is the process of identifying risks and deciding what to do about them. Risk management is increasingly important to IT in general, and to operations groups in particular, because organizations are more susceptible today to disruptions in service caused by problems in their IT environments.

Both the number and the severity of potential IT failures (specifically the ones related to IT operations) are increasing over time because:

Business transactions and processes are increasingly dependent on IT, so failures in IT are more likely to affect the business, and that impact is more likely to be severe.
The IT environment is increasingly complex, so even if the environment stays the same size, the number of potential failure points is rising.
IT directly controls less of the infrastructure, so managing the possibility of failure is more important because IT has less ability to react after the failure occurs.
When an IT failure occurs, there is less time between the occurrence of the failure and its impact on the business.
IT failures are increasingly visible outside the data center, so more people are negatively affected when a failure occurs.

In short, IT has more potential to support and enhance business processes than ever before; but, in turn, failures in IT have more potential to disrupt business operations and directly affect an organization's profitability and success.

Click on the links below to examine the trends in IT failure in more detail.

Business Is More Dependent on IT

Today, more of the systems that IT manages are critical to successful business operations. For example, 10 years ago, communication in many companies was based on such non-IT services as paper memos, an internal mailroom service, an external postal service, and the telephone. Today, IT is responsible for e-mail service, intranets, and Internet sites-communication systems that were not considered business-critical a decade ago. Because of this increasing reliance on IT services, the potential failure of these services presents an increasing source of risk to the business.

IT Environment Is More Complex

A typical IT environment contains more components today than in the past. There are more desktops, servers, and connections, more end-to-end services, and more integration of systems. This is partly due to the move from centralized computing, then to client/server computing, and more recently to the vision of Microsoft .NET, in which all objects are logically distributed. As this progression takes place, the number of items in the infrastructure increases, even if the scope of the infrastructure stays the same.

The diversity of the infrastructure has also increased. For example, IT groups that formerly maintained the links between several terminals and a handful of hosts now must keep track of local area networks (LANs), wide area networks (WANs), land lines, dial-up access, wireless links, and internal networks-as well as connections to the Internet. Client systems are another example: In the past, IT dealt with terminals, but today client hardware can range from desktops or laptops, to handheld computers, wireless information appliances, or Internet-enabled phones and pagers.

The number of users is also increasing. In the early days of computers, a few operators interacted with a mainframe; later the number of users grew to include a few dozen clerks, then a few hundred knowledge workers on the mainframe and on personal computers. Today, even more customers reach e-commerce sites from their home systems. In addition to their numbers, the autonomy of users is increasing as well. Previously, mainframe users did not upgrade software on their own, but home users now do this all the time.

IT Directly Manages Less of the Infrastructure

Many of the systems that are part of IT services are now managed outside the organization. For example, a retailer that receives orders on its Web site might rely on other companies' systems for credit verification, warehousing, Web services, and shipping.

The "virtual IT environment" does not necessarily increase the potential for failure but can, in fact, decrease risk by the outsourcing of a service to specialists who are best able to operate it and prevent it from failing. However, this trend is important for risk management because the business still expects the IT department to be responsible for the IT infrastructure and end-to-end services regardless of their source.

Less Time Between Failure and Impact

If a service fails, there is a window of time during which the IT group can attempt to recover the service before the failure directly affects the business. For example, if an organization uses a billing system that prints and mails monthly statements to its customers showing their outstanding balances, and if that system fails, the window of opportunity to fix the problem might be hours or even days-so long as the statements are received in time to allow customers to pay them before they become overdue.

If IT can recover the service within that time, then the organizations' customers will receive their payment reminders on time, and the revenue stream won't be interrupted. The customers of an e-commerce site, however, may expect transactions to complete within 10 seconds and to receive e-mail confirmation of each transaction within another 5 minutes. Quite clearly, in this scenario any failure would immediately affect the business and may result in customers giving up and going elsewhere.

Failure Is More Visible

Years ago, IT managers might have wondered, "If a service fails in the data center and no one notices, is it a crisis?" That question has become irrelevant to many IT groups because IT service failures are immediately noticeable throughout the organization. Five years ago, if your company's Web site was unavailable for an hour, the only people who noticed were your own IT staff. Today, the list of people who would notice that failure might include hundreds of customers, a dozen competitors, and every analyst who tracks your company's stock.

Visibility is important because people not only notice failures, they also react. A case in point is a well-publicized, day-long service outage suffered by an online auction site. Customers noticed it, so to satisfy them, the site's parent company refunded all the fees it collected for every auction in progress, a sum reportedly equal to one-third of the company's quarterly profits. Analysts and investors noticed the problem, too-the company lost 25 percent of its market capitalization in two days.

The first four trends described previously make failure more likely and more severe; the visibility of mission-critical systems outside the company amplifies the severity of failure.