System Health Check

Monitoring systems in a network proactively helps prevent potential issues and take preventive actions. This chapter describes the tasks to configure and monitor system health check.

System Health Check

Proactive network monitoring systems play a pivotal role in averting any issues. NCS 1014 health check service lets you monitor physical characteristics, current processing status, and the curently utilized resources to quickly assess the condition of the device at any time. This service helps to analyze the system health by monitoring, tracking and analyzing metrics that are critical for functioning of the NCS 1014. The system health metrics are thresholds set on the device in order to monitor the usage of CPU and other system resources. The health check service is installed with the NCS 1014 RPM.

You can evaluate the system's health by examining the metric values. If these values cross or approach the set thresholds, it suggests potential problems. By default, metrics for system resources are configured with preset threshold values. You can customize the metrics to be monitored by disabling or enabling metrics of interest based on your requirement.

Each metric is tracked and compared with that of the configured threshold, and the state of the resource is classified accordingly.

The system resources metrics can be in one of these states:

  • Normal: The resource usage is less than the threshold value.

  • Minor: The resource usage is more than the minor threshold, but less than the severe threshold value.

  • Severe: The resource usage is more than the severe threshold, but less than the critical threshold value.

  • Critical: The resource usage is more than the critical threshold value.

The infrastructure services metrics can be in one of these states:

  • Normal: The resource operation is as expected.

  • Warning: The resource needs attention. For example, a warning is displayed when the FPD needs an upgrade.

Supported System Health Check Metrics

NCS 1014 supports the following system health check metrics:

  • communication-timeout

  • cpu

  • filesystem

  • fpd

  • free-mem

  • hw-monitoring

  • lc-monitoring

  • pci-monitoring

  • platform

  • process-resource

  • process-status

  • shared-mem

  • wd-monitoring

Enable Health Check

To enable health check, perform the following steps:

Before you begin

Before enabling health check, ensure that:

  • An IP address and subnet mask is assigned to the management interface.

  • The IP address of the default gateway is configured with a static route.

For more details, see the Configure Management Interface section of the Cisco NCS 1014 System Setup and Software Installation Guide.

Procedure


Step 1

Enter into the configuration mode using the configuration command.

Step 2

Enable health check using the healthcheck enable command.

Example:

RP/0/RP0/CPU0:ios(config)# healthcheck enable

Step 3

Run the netconf-yang agent ssh command.

Example:

RP/0/RP0/CPU0:ios(config)# netconf-yang agent ssh

Step 4

Enable Google Remote Procedure Call (gRPC) using the grpc local-connection command.

Example:

RP/0/RP0/CPU0:ios(config)# grpc local-connection

Step 5

Commit the changes using the commit command.


Change Health Check Refresh Time

Cadence is the time interval, in seconds, at which the health check status is refreshed. By default, this time is 60 seconds which means that health check status is updated every 60 seconds. You can change this time using the healthcheck cadence cadence-value command.

The following example shows to change the health check cadence value to 50 seconds so that health check status is updated every 50 seconds.
RP/0/RP0/CPU0:ios(config)#healthcheck cadence 50

View Status of All Metrics

You can view the status of all the supported metrics with the associated threshold and configured parameters in the system. To check the status of all the metrics, perform these steps:

Procedure


Step 1

Run the show healthcheck status command.

Example:

RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios#show healthcheck status
Sat Jun 12 02:00:25.204 UTC

Healthcheck status: Enabled
Time started: 12 Jun 02:00:22.392972

Collector Cadence: 30 seconds

METRICS STATS 

System Resource metrics
   cpu
       Thresholds: Minor: 20%
                  Severe: 50%
                Critical: 75%

       Tracked CPU utilization: 15 min avg utilization

   free-memory
       Thresholds: Minor: 10%
                  Severe: 8%
                Critical: 5%

   filesystem
       Thresholds: Minor: 80%
                  Severe: 95%
                Critical: 99%
          
   shared-memory
       Thresholds: Minor: 80%
                  Severe: 95%
                Critical: 99%
          
Infra Services metrics
   platform
          
   fpd    
          
Install Custom Metrics
   process-status
          
   process-resource
          
   communication-timeout
          
   pci-monitoring
          
   hw-monitoring
          
   wd-monitoring
          
   lc-monitoring
          
Use case  
 Use cases are disabled

Step 2

To view the health state of the health check manager, use the show healthcheck internal states command.

Example:

RP/0/RP0/CPU0:ios#show healthcheck internal states 
Sat Jun 12 02:00:55.425 UTC

 Internal Structure INFO 

 Current state: Enabled 

 Reason: Success 

 Netconf Config State: Enabled 

 Grpc Config State: Enabled 

 Nosi state: Initialized 

 Appmgr conn state: Connected 

 Nosi lib state: Not ready 

 Nosi client: Valid client  

Step 3

To view the health state for each enabled metric, use the show healthcheck report command.

Example:

RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios#show healthcheck report     
Sat Jun 12 02:02:54.417 UTC

Healthcheck report
Last Update Time: 12 Jun 02:02:46.955241 
METRICS REPORT 

cpu
  State: Normal

free-memory
  State: Normal

filesystem
  State: Normal

shared-memory
  State: Normal

platform
  State: Warning
  Reason: One or more devices are not in operational state

fpd
  State: Warning
  Reason: One or more FPDs are not in CURRENT state
          
process-status
  State: Normal
          
process-resource
  State: Normal
          
communication-timeout
  State: Normal
          
pci-monitoring
  State: Normal
          
hw-monitoring
  State: Normal
          
wd-monitoring
  State: Normal
          
lc-monitoring
  State: Normal
In the above output, the state of the FPD shows a warning message that indicates an FPD upgrade is required.

Change Threshold Value for a Metric

You can customize the health check threshold value for a metric using the following command:

healthcheck metric metric-name threshold threshold-value

Example to Change Preset Metric Value

The following example shows to change the threshold value of CPU metric to 25%.


RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios(config)#healthcheck metric cpu minor threshold 25%
      

View Health Status of Individual Metric

You can view the health status of a system resource or infrastructure service metric in the system.

Procedure


Run the show healthcheck metric metric-name command.

Example:

The following example shows how to obtain the health-check status for the filesystem metric:
RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios#show healthcheck metric filesystem 
Sat Jun 12 02:01:32.432 UTC
Filesystem Metric State: Normal
Last Update Time: 12 Jun 02:01:04.446619
Filesystem Service State: Enabled
Number of Active Nodes: 1
Configured Thresholds:
   Minor: 80%
   Severe: 95%
   Critical: 99%

Node Name: 0/RP0/CPU0
    Partition Count: 5

    Partition Name: tftp:
        Partition Access Attribute: rw
        Partition Type: network
        Partition Size: 0
        Partition Free Bytes: 0
        Partition Free Space in %: 0

    Partition Name: disk0:
        Partition Access Attribute: rw
        Partition Type: flash-disk
        Partition Size: 20024897536
        Partition Free Bytes: 19978481664
        Partition Free Space in %: 99
          
    Partition Name: /misc/config
        Partition Access Attribute: rw
        Partition Type: flash
        Partition Size: 151314698240
        Partition Free Bytes: 146903269376
        Partition Free Space in %: 97
          
    Partition Name: harddisk:
        Partition Access Attribute: rw
        Partition Type: harddisk
        Partition Size: 150114078720
        Partition Free Bytes: 144962641920
        Partition Free Space in %: 96
          
    Partition Name: ftp:
        Partition Access Attribute: rw
        Partition Type: network
        Partition Size: 0
        Partition Free Bytes: 0
        Partition Free Space in %: 0

Example:

The following example shows how to obtain the health-check status for the platform metric:
RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios#show healthcheck metric platform   
Sat Jun 12 02:01:51.922 UTC
Platform Metric State: Warning
Last Update Time: 12 Jun 02:01:38.650003
Platform Service State: Enabled
Number of Racks: 1

Rack Name: 0
    Number of Slots: 5

    Slot Name: RP0
        Number of Instances: 1

    Instance Name: CPU0
        Node Name 0/RP0/CPU0
        Card Type NCS1K14-CNTLR-K9
        Card Redundancy State Active
        Admin State NSHUT,NMON
        Oper State IOS XR RUN

    Slot Name: PM1
        Number of Instances: 0

        Node Name 0/PM1
        Card Type NCS1K4-AC-PSU-2
        Card Redundancy State None
        Admin State NSHUT,NMON
        Oper State OPERATIONAL
          
    Slot Name: FT1
        Number of Instances: 0
          
        Node Name 0/FT1
        Card Type NCS1K14-FAN
        Card Redundancy State None
        Admin State NSHUT,NMON
        Oper State OPERATIONAL
          
    Slot Name: FT2
        Number of Instances: 0
          
        Node Name 0/FT2
        Card Type NCS1K14-FAN
        Card Redundancy State None
        Admin State NSHUT,NMON
        Oper State OPERATIONAL
          
    Slot Name: 2
        Number of Instances: 1
          
    Instance Name: NXR0
        Node Name 0/2/NXR0
        Card Type NCS1K4-1.2T-K9
        Card Redundancy State None
        Admin State NSHUT,NMON
        Oper State CARD FAILED

Disable Health Check

You can disable health check service or disable health check for an individual metric. By default, health check of all the metrics is enabled.

Disable Health Check Service

To disable health check service, use the following command:

no healthcheck enable


Note


When the health check service is enabled, other configuration changes are not permitted. Disable the service before committing configuration changes.


The following example shows to disable the health check service.

RP/0/RP0/CPU0:#configure
RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios(config)#no healthcheck enable
RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios(config)#commit

Disable Health Check for a Metric

To disable health check for an individual metric, use the following command:

healthcheck metric metric-name disable

Example to Disable Health Check of a Metric

The following example shows to disable the free memory (free-mem) metric.


RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios(config)#healthcheck metric free-mem disable