2014年7月17日 星期四

SCOM - Causes of a gray state

Causes of a gray state

An agent, a management server, or a gateway may become unavailable for any of the following reasons:
  • Heartbeat failure
  • Invalid configuration
  • System workflows failure
  • OpsMgr Database or data warehouse performance issues
  • RMS or primary MS or gateway performance issues
  • Network or authentication issues
  • Health service issues (service is not running)

Issue scope

Before you begin to troubleshoot the agent "grayed out" issue, you should first understand the Operations Manager topology, and then define the scope of the issue. The following questions may help you to define the scope of the issue:
  • How many agents are affected?
  • Are the agents experiencing the issue in the same network segment?
  • Do the agents report to the same management server?
  • How often do the agents enter and remain in a gray state?
  • How do you typically recover from this situation (for example, restart the agent health service, clear the cache, rely upon automatic recovery)?
  • Are the Heartbeat failure alerts generated for these agents?
  • Does this issue occur during a specific time of the day?
  • Does this issue persist if you failover these agents to another management server or gateway?
  • When did this problem start?
  • Were any changes made to the agents, the management servers, or the gateway or management group?
  • Are the affected agents Windows clustered systems?
  • Is the Health Service State folder excluded from antivirus scanning?
  • What is the environment this is occurring in OpsMgr SP1, R2, 2012?

Troubleshooting strategy

Your troubleshooting strategy will be dictated by which component is inactive, where that component falls within the topology, and how widespread the problem is. Consider the following conditions:
  • If the agents that report to a particular management server or gateway are unavailable, troubleshooting should start at the management server or gateway level.
  • If the gateways that report to a particular management server are unavailable, troubleshooting should start at the management server level.
  • For agentless systems, for Network devices, and for Unix/Linux servers, troubleshooting should start at the agent, management server, or gateway that is monitoring these objects.
  • If all the systems are unavailable, troubleshooting should start at the root management server.
  • Troubleshooting typically starts at the level immediately above the unavailable component.

沒有留言:

張貼留言