Failure and Restore

Even with the best maintenance practices, data might become corrupted, causing interruption to Operations Manager 2007 functionality and loss of data. There are various causes for failure. Some of the most common causes can be classified as:

Hardware failure, such as the storage system impacting data availability or integrity
Security breach or virus infection
Accidental deletion or corruption of Active Directory Domain Services (AD DS) security information, such as accounts or groups
Physical disaster

This section describes various Operations Manager failure scenarios, and how to restore Operations Manager components to resume services.

Impact of Failure in Operations Manager 2007

Various Operations Manager servers and components can potentially fail, impacting the Operations Manager functionality. The amount of data and functionality lost during a failure is different in each failure scenario. It depends on the role of the failing component, on the Operations Manager deployment, on the length of time it takes to restore the failing component, and on the availability of back ups.

Reduce the Impact of Failure

The effects of some server failures can be reduced significantly by adding redundancy or implementing a failover solution, such as clustering. Also, in that case, the urgency of restoration is greatly reduced.

The following list includes configuration options that add redundancy and clustering to the Operations Manager deployment. Implementing any of those options will reduce the impact of failure, and contribute to high availability of Operations Manager in your organization:

Add Management Servers
Install the Root Management Server into a Microsoft Cluster service failover cluster
Store databases in a Microsoft Clustering service failover cluster
Configure gateway servers for failover
Configure log shipping
Configure cross Management Group failover

Each option is further described in Operations Manager 2007 Operations Guide. For further information about deployment options that help ensure high availability and help reduce the impact of failure, see the Operations Manager 2007 Deployment Guide.

Failure Recovery Scenarios for the Root Management Server

This topic describes three failure recovery scenarios for Root Management Servers:

If the Root Management Server fails, promote another management server to be the Root Management Server.
If the Root Management Server fails, promote another management server to be the Root Management Server. After the original Root Management Server is available again, you have the option of promoting it back to being the Root Management Server.
If the Root Management Server fails, promote another management server to be the Root Management Server. In the future, you can set up a new management server and promote it to be the Root Management Server.

Each scenario is further described in Operations Manager 2007 Operations Guide

High Level Restore Guidelines in Operations Manager 2007

See Table: General Restore Steps in Operations Manager 2007 Operations Guidefor information about possible failure scenarios, and the general steps required to restore operations and data in that scenario.