docs/doc/source/fault-mgmt/200-series-alarm-messages.rst
Stone 037c99f0b7 Fault Management doc
Added Data Networks toctree

Changed case on doc title in top level index - changed doc directory to
fault-mgmt.

Added Distributed Cloud section.

Broke out "OpenStack Fault Management Overview" statement about remote log
collection to conditionally included file.

Incorporated patch 6 review comments. Also implemented rST :abbr:
for first instance of SNMP in each file.

Changed port number and community string in two SNMP walk examples.

Change-Id: I1afd71265e752c4c9a54bf2dc9a173b3e17332a7
Signed-off-by: Stone <ronald.stone@windriver.com>
2020-11-27 14:13:00 -05:00

402 lines
10 KiB
ReStructuredText

.. uof1579701912856
.. _200-series-alarm-messages:
=========================
200 Series Alarm Messages
=========================
The system inventory and maintenance service reports system changes with
different degrees of severity. Use the reported alarms to monitor the overall
health of the system.
.. include:: ../_includes/x00-series-alarm-messages.rest
.. _200-series-alarm-messages-table-zrd-tg5-v5:
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.001**
- <hostname> was administratively locked to take it out-of-service.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- None
* - Severity:
- W\*
* - Proposed Repair Action
- Administratively unlock Host to bring it back in-service.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.004**
- <hostname> experienced a service-affecting failure.
Host is being auto recovered by Reboot.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- None
* - Severity:
- C\*
* - Proposed Repair Action
- If auto-recovery is consistently unable to recover host to the
unlocked-enabled state contact next level of support or lock and replace
failing host.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.005**
- Degrade:
<hostname> is experiencing an intermittent 'Management Network'
communication failures that have exceeded its lower alarming threshold.
Failure:
<hostname> is experiencing a persistent Critical 'Management Network'
communication failure.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- None
* - Severity:
- M\* (Degrade) or C\* (Failure)
* - Proposed Repair Action
- Check 'Management Network' connectivity and support for multicast
messaging. If problem consistently occurs after that and Host is reset,
then contact next level of support or lock and replace failing host.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.006**
- Main Process Monitor Daemon Failure \(Major\)
<hostname> 'Process Monitor' \(pmond\) process is not running or
functioning properly. The system is trying to recover this process.
Monitored Process Failure \(Critical/Major/Minor\)
Critical: <hostname> Critical '<processname>' process has failed and
could not be auto-recovered gracefully. Auto-recovery progression by
host reboot is required and in progress.
Major: <hostname> is degraded due to the failure of its '<processname>'
process. Auto recovery of this Major process is in progress.
Minor:
<hostname> '<processname>' process has failed. Auto recovery of this
Minor process is in progress.
<hostname> '<processname>' process has failed. Manual recovery is required.
tp4l/phc2sys process failure. Manual recovery is required.
* - Entity Instance
- host=<hostname>.process=<processname>
* - Degrade Affecting Severity:
- Major
* - Severity:
- C/M/m\*
* - Proposed Repair Action
- If this alarm does not automatically clear after some time and continues
to be asserted after Host is locked and unlocked then contact next level
of support for root cause analysis and recovery.
If problem consistently occurs after Host is locked and unlocked then
contact next level of support for root cause analysis and recovery.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.007**
- Critical: \(with host degrade\):
Host is degraded due to a 'Critical' out-of-tolerance reading from the
'<sensorname>' sensor
Major: \(with host degrade\)
Host is degraded due to a 'Major' out-of-tolerance reading from the
'<sensorname>' sensor
Minor:
Host is reporting a 'Minor' out-of-tolerance reading from the
'<sensorname>' sensor
* - Entity Instance
- host=<hostname>.sensor=<sensorname>
* - Degrade Affecting Severity:
- Critical
* - Severity:
- C/M/m
* - Proposed Repair Action
- If problem consistently occurs after Host is power cycled and or reset,
contact next level of support or lock and replace failing host.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.009**
- Degrade:
<hostname> is experiencing an intermittent 'Cluster-host Network'
communication failures that have exceeded its lower alarming threshold.
Failure:
<hostname> is experiencing a persistent Critical 'Cluster-host Network'
communication failure.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- None
* - Severity:
- M\* (Degrade) or C\* (Critical)
* - Proposed Repair Action
- Check 'Cluster-host Network' connectivity and support for multicast
messaging. If problem consistently occurs after that and Host is reset,
then contact next level of support or lock and replace failing host.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.010**
- <hostname> access to board management module has failed.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- None
* - Severity:
- W
* - Proposed Repair Action
- Check Host's board management configuration and connectivity.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.011**
- <hostname> experienced a configuration failure during initialization.
Host is being re-configured by Reboot.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- None
* - Severity:
- C\*
* - Proposed Repair Action
- If auto-recovery is consistently unable to recover host to the
unlocked-enabled state contact next level of support or lock and
replace failing host.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.012**
- <hostname> controller function has in-service failure while compute
services remain healthy.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- Major
* - Severity:
- C\*
* - Proposed Repair Action
- Lock and then Unlock host to recover. Avoid using 'Force Lock' action
as that will impact compute services running on this host. If lock action
fails then contact next level of support to investigate and recover.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.013**
- <hostname> compute service of the only available controller is not
operational. Auto-recovery is disabled. Degrading host instead.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- Major
* - Severity:
- M\*
* - Proposed Repair Action
- Enable second controller and Switch Activity \(Swact\) over to it as
soon as possible. Then Lock and Unlock host to recover its local compute
service.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.014**
- The Hardware Monitor was unable to load, configure and monitor one
or more hardware sensors.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- None
* - Severity:
- m
* - Proposed Repair Action
- Check Board Management Controller provisioning. Try reprovisioning the
BMC. If problem persists try power cycling the host and then the entire
server including the BMC power. If problem persists then contact next
level of support.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 200.015**
- Unable to read one or more sensor groups from this host's board
management controller.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- None
* - Severity:
- M
* - Proposed Repair Action
- Check board management connectivity and try rebooting the board
management controller. If problem persists contact next level of
support or lock and replace failing host.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 210.001**
- System Backup in progress.
* - Entity Instance
- host=controller
* - Degrade Affecting Severity:
- None
* - Severity:
- m\*
* - Proposed Repair Action
- No action required.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 250.001**
- <hostname> Configuration is out-of-date.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- None
* - Severity:
- M\*
* - Proposed Repair Action
- Administratively lock and unlock <hostname> to update config.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 250.003**
- Kubernetes certificates rotation failed on host <hostname>.
* - Entity Instance
- host=<hostname>
* - Degrade Affecting Severity:
- None
* - Severity:
- M/w
* - Proposed Repair Action
- Rotate kubernetes certificates manually.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 270.001**
- Host <host\_name> compute services failure\[, reason = <reason\_text>\]
* - Entity Instance
- host=<host\_name>.services=compute
* - Degrade Affecting Severity:
- None
* - Severity:
- C\*
* - Proposed Repair Action
- Wait for host services recovery to complete; if problem persists contact
next level of support.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 280.001**
- <subcloud> is offline.
* - Entity Instance
- subcloud=<subcloud>
* - Degrade Affecting Severity:
- None
* - Severity:
- C\*
* - Proposed Repair Action
- Wait for subcloud to become online; if problem persists contact next
level of support.
-----
.. list-table::
:widths: 6 15
:header-rows: 0
* - **Alarm ID: 280.001**
- <subcloud><resource> sync status is out-of-sync.
* - Entity Instance
- \[subcloud=<subcloud>.resource=<compute> \| <network> \| <platform>
\| <volumev2>\]
* - Degrade Affecting Severity:
- None
* - Severity:
- M\*
* - Proposed Repair Action
- If problem persists contact next level of support.