Troubleshooting the ESI Service and ESI SCOM Management Packs

This section provides known problems and limitations for troubleshooting the ESI Service and ESI SCOM Management Packs.

Symptom

Prevention, resolution, or workaround

When upgrading to version 3.5 (or later versions) of the ESI SCOM Management Packs, reimporting the management packs in SCOM fails.

When upgrading, SCOM requires that you delete the existing ESI version 2.1 SCOM Management Packs from SCOM before you can install and import the latest version of the SCOM Management Packs. Installing the ESI SCOM Management Packs provides instructions.

After upgrading ESI and importing the latest version of the ESI SCOM Management Packs, your override settings for the ESI SCOM management packs no longer exist in SCOM.

The EMC.SI.Customization.xml management pack file contains your SCOM overrides and customizations. When importing the management packs into SCOM, you might have overridden this file and lost your settings.

You can reimport the latest backup copy of this file to get your customizations back.

Notice: This file is installed with version number 1.0.0.0. You can increment the version number when you make changes.

How to Import a Management Pack in Operations Manager 2007 and How to Import a Management Pack in Operations Manager 2012 on Microsoft TechNet provides instructions for importing the management packs.

SCOM degrades the health state of the snapshot pool for reserved LUN pools of VNX or CLARiiON block systems, regardless of the true health state of the reserved LUN pool in Unisphere, which causes a warning error and generates an alert.

Unisphere does not provide an operational status for reserved LUN pools, so SCOM defaults to "unknown" for the health state of the snapshot pool. This unknown state in SCOM degrades the health of the system, which generates an incorrect warning error and alert in SCOM.

You can disable the health monitor for the snapshot pools in SCOM to avoid this incorrect warning error and alert. How to Enable or Disable a Rule or Monitor on Microsoft TechNet provides instructions.

  • Some or all system components are not discovered or monitored by SCOM.

  • Event 104 appears in the SCOM agent event log, which includes basic connection information and an HTTPS link to EMC SI Service that cannot be completed.

  • EMC SI Service Discovery in SCOM cannot connect to the ESI Service, or the connection time is unacceptable.

  • Workflow processes are timing out.

  • Confirm that the firewall settings are correct for both the SCOM agent and ESI Service.

  • Enable and use EMC SI Windows Service Monitoring to confirm that the ESI Service is running on the ESI host.

  • Open this link on the SCOM Agent machine: https://<ESI Service IP>:<https port>/esi/console/graph/Entities?class=StorageSystem, replacing <ESI Service IP> and <https port> with the applicable values. Then confirm that the load time displayed at the bottom of the page is less than one second. Repeat this step a few times for consistent results.

  • Confirm that the SCOM agent connects successfully with the ESI Service. The connection information and link are provided in Event 104. Use credentials specified in the Setting up the EMC SI Monitoring Profiles for the SCOM agent. Open the HTTP link provided in the Event 104 description and confirm that the event completes in less than two minutes. If the connection fails, investigate the cause and update EMC SI discovery overrides accordingly. Changing the discovery interval overrides provides more details for changing the overrides.

  • List of physical or logical components displayed in the views are not current.

  • New system changes are not being discovered in SCOM.

If the components do not appear after the set interval refresh time has passed, try toggling the Enabled override properties setting for the EMC SI Service Discovery in SCOM:

  1. Open the EMC SI Service Discovery Overrides Properties window.

  2. Clear the checkbox for the Enabled override setting and click Apply to apply the change.

  3. Before closing the window, select the Enabled checkbox again.

This override change triggers the discovery process. After toggling the override setting, check in the Operations Manager event log on the SCOM agent for two sequences: Event 1201 followed by Event 1210. If these occurred, then the discoveries should be current.

The Subscribed Capacity Presentation view in SCOM does not display the system serial numbers.

To work around this SCOM limitation, you can create a group for each system and a favorite view for the group:

  1. In Operations Manager, go to Authoring > Groups > Create a new Group, enter a name, and select EMC Storage Integrator Customizations management pack to save changes.

  2. Select Dynamic Members > Create/Edit rules, and select EMC SI Storage System > Add.

  3. Select Serial Number Equals <serial number> and Create.

  4. To create a view for the group, select My Workspace > Favorite Views > New > Performance View > type a name for the view> select collected by specify rules > choose Storage Pool Available Capacity Performance Collection and Create.

The pools of only the specified systems will be displayed.

  • Updates for the component health status take more than 40 minutes to update the SCOM views.

  • Long delays exist between changes in health of components and the changes being updated in SCOM views.

  • Confirm that the related SCOM agent is running without performance problems. If errors occur, troubleshoot those as described in the previous resolution.

  • Reduce the ESI Service System Refresh Interval, which is set to 30 minutes by default. Changing the system refresh interval has more details.

  • Reduce the Interval override for monitors that experience latency, which by default is set to six minutes.

  • Changing discovery interval overrides has more details.

One or more systems do not appear in the SCOM view and are not discovered by SCOM.

  • Confirm that the related SCOM Agent successfully connects to the ESI Service.

  • Confirm that the system is registered with the ESI Service.

  • Confirm that the System Filter file exists on the related SCOM agent and has the correct list of ESI Service Registered System Friendly Names.

  • SCOM agent is experiencing performance problems due to a large number of monitored component instances.

  • List of discovered components is not complete. Health state of components is not current and other suggestions do not work.

  • Event 6022 from the Health Service Script does not appear in Operations Manager event log on the SCOM agent machine for more than 15 minutes.

  • Performance counters related to the CPU or memory usage are typically hitting the maximum limits.

  • Event 21411 from the Health Service modules appears and includes the "process will be dropped because it has been waiting in the queue for more than 10 minutes" message.

  • Event 1101 from the Health Service appears multiple times in the Operations Manager Event log on the SCOM agent computer.

  1. Increase the local data queue on the SCOM agent machine by updating the Registry key:

  1. Replace <MG> with the SCOM management group name and size, which can be between the default 15360 (15 MB) and 102400 (100 MB):

HKEY_LOCAL_MACHINE\SYSTEM\Current ControlSet\Services\HealthService\Parameters\ Management Groups\<MG>\MaximumQueueSizeKb

  1. Restart the HealthService.

Try the Flush Health Service and Cache task. To do this:

  1. In Operations Manager, go to Monitoring > Operations Manager > Agent-Details > Agent Health State and, in Agent State view,  click the SCOM agent machine.

  2. In the Health Service Tasks section of the Actions pane, run the Flush Health Service and Cache task.

Adding resources to share the monitoring can improve data and I/O performance for large storage environments. Consider adding more SCOM agents or more ESI Services to share the monitoring of multiple systems with heavy traffic. With more than one SCOM agent to monitor one or more ESI Services, you can assign fewer systems to each SCOM agent or each ESI Service. Use the System Filter file to assign systems to different SCOM agents.

Sample event logs can also provide assistance with diagnosing issues.

  • SCOM agent changes to a gray state or the discovery is not complete within an acceptable time.

  • Too many LUN masking views are being discovered.

  • SCOM does not discover all of the storage groups.

  • The system has more storage groups than SCOM discovers.

The discovery override for the EMC SI Storage Group limits the number of discovered instances in SCOM. Confirm that the override has the correct limit.

The maximum limit for this override is 5000. To improve performance, change the discovery override to a smaller number.

  • The ESI Service or the SCOM agents have connection problems.

  • Time-out error message Event 104 or Event 21402 occurs.

Check the Operations Manager event log for any events and also try to connect to ESI Service from a web browser on the SCOM agent computer. Changing HTTP connection defaults has more details.

The Sample event logs include an Event 21402 log example.

Proxy Monitoring is not available (grayed out) in Operations Console and Event 623 occurs in Operations Manager.

If the Proxy Monitoring agent is monitoring systems with a large number of components, distributing the monitoring of systems to more proxy agents may solve the problem. Or you can make the registry change as described in One or more management servers and their managed devices are dimmed in the Operations Manager Console of Operations Manager at http://support.microsoft.com/kb/975057.

SCOM does not discover a VPLEX system.

  • Confirm the SCOM Management Pack is set up with the correct ESI Service host and SSL port 54501.

  • Confirm the ESI Service is running.

  • Confirm the user is in an administrator group (if UAC is enabled, web browser must be launched with Run As Administrator).

  • Check event logs for ESI Service errors.

  • Check that the remote connection to ESI Service uses one of the following:

  • http://<host>:54500/esi/console

  • https://<host>:54501/esi/console

  • Confirm that the firewall settings are correct for both the SCOM agent and ESI Service.

  • Check SSL Certificate on ESI Service host is set up correctly: Get-ChildItem cert:\LocalMachine\My

  • Confirm the latest service packs and cumulative updates are deployed on the SCOM agents and clients.

  • Confirm the SCOM server and agent systems meet the minimum system requirements.

Event 21114 occurs in the Operations Manager event log.

 

Confirm that HKLM\System\CurrentControlSet\Services\HealthService\Parameters => Persistence Version Store Maximum has been changed to 5120 (decimal).

VPLEX system does not appear in SCOM for the SCOM monitoring agent.

Check event log for event 104 and confirm the ESI Service connection information is set up correctly in SCOM.

VPLEX discovery times out.

  • The SCOM agent might be monitoring too many systems. Check event log for events: 6024, 2114, 21402.

  • Consider using the system filter file to assign systems to specific SCOM agents.

Sample event logs

The key information is highlighted as bold text in the following event log examples. The events generated from the monitoring agent Operation Manager event log and alerts are also added to the Monitoring Delays, Errors and Timeouts view in the Diagnostics folder.

The following is an example of an event log with connection problems. In this example, Event 21402 occurred because of a disk drive component problem. By locating the problem component class, you can then decide which component monitor to troubleshoot and maybe change that specific time-out interval override while resolving the problem:

Log Name:      Operations Manager

Source:        Health Service Modules

Date:          9/19/2012 1:24:59 PM

Event ID:      21402

Task Category: None

Level:         Warning

Keywords:      Classic

User:          N/A

Computer:      PATHENDGSCOM.PATHENDG.emc.com

Description:

Forced to terminate the following process started at 1:24:43 PM because it ran past the configured time-out 600 seconds.

 

Command executed:    "C:\Windows\system32\cscript.exe" /nologo "GetEntityStatus.js" 10.5.222.40 7001 c85793d1108ee9f4c30a970941593d7c966a2748 DiskDrive none none True True false 0

Working Directory:   C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 1\126109\

 

One or more workflows were affected by this. 

 

Workflow name: many

Instance name: many

Instance ID: many

Management group: JerryAir

 

The following is an example of an event log that has overloaded resources:

 

Event 21411

Level Warning

Source Health Service Modules

 

The process will be dropped because it has been waiting in the queue for more than 10 minutes.

 

Command executed:       "%windir%\system32\cscript.exe" /nologo "DiscoverLunStorageServiceNode.js" {934DBB77-5CDA-4EF8-E2D5-37DE605B11A9} {A86B6475-C74D-7AF0-1B69-AEA88050B9EF} ZBSCOM2007.ZBEMC.dev 10.5.222.40 7001 3f08b7000dc65c9f29417af195d75cac12f5ea3e 6bec6ca7f35635f45d6d5f54c6e4d7996f3e37b8 none none True False 0

Working Directory:         

 

One or more workflows were affected by this. 

 

Workflow name: EMC.ESI.LunStorageServiceNodeDiscoveryRule

Instance name: Bus 1 Enclosure 1

Instance ID: {A86B6475-C74D-7AF0-1B69-AEA88050B9EF}

Management group: ZBDEV