Troubleshoot Guide for Continuous Restart of Grafana Pods

Available Languages

Download Options

ePub (83.7 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (71.3 KB)
View on Kindle device or Kindle app on multiple devices

Updated:January 25, 2022

Document ID:217655

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Background Information

What is SMI?

What is SMI CEE?

What are CEE Pods?

What is a Grafana Pod?

What is a Postgres Pod?

Problem

Workaround

Shutdown CEE

Remove the DB Folders for Postgres Pods

Restore CEE

Post checks

Verify Kubernetes from the Master

Verify Alerts are Cleared from CEE

Introduction

This document describes a workaround to recover a Grafana pod that restarts continuously.

Prerequisites

Requirements

Cisco recommends that you have knowledge of these topics:

Cisco Subscriber Microservices Infrastructure (SMI) Ultra Cloud Core Common Execution Environment (CEE)
5G Cloud Native Deployment Platform (CNDP) or SMI-bare-metal (BM) architecture
Dockers and Kubernetes

Components Used

The information in this document is based on these software and hardware versions:

SMI 2020.02.2.35
Kubernetes v1.21.0

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.

Background Information

What is SMI?

Cisco SMI is a layered stack of cloud technologies and standards that enable microservices-based applications from the Cisco Mobility, Cable, and Broadband Network Gateway (BNG) business units. These applications have similar subscriber management functions and similar datastore requirements.

Attributes:

The Layer Cloud Stack (technologies and standards) provides top-to-bottom deployments and accommodates current cloud infrastructures.
All applications share the CEE for non-application functions (data storage, deployment, configuration, telemetry, alarm), which provides a consistent interaction and experience for all customer touchpoints and integration points.
Applications and the CEE are deployed in microservice containers and are connected with an Intelligent Service Mesh.
An exposed API for deployment, configuration, and management enables automation.

What is SMI CEE?

The CEE is a software solution that was developed to monitor mobile and cable applications that are deployed on the SMI. The CEE captures information (key metrics) from the applications in a centralized way for engineers to debug and troubleshoot.
The CEE is the common set of tools that are installed for all the applications. It comes equipped with a dedicated Ops Center, which provides the Command Line Interface (CLI) and APIs to manage the monitor tools. Only one CEE is available for each cluster.

What are CEE Pods?

A pod is a process that runs on your Kubernetes cluster. A pod encapsulates a granular unit that is known as a container. A pod contains one or multiple containers.
Kubernetes deploys one or multiple pods on a single node, which can be a physical or virtual machine. Each pod has a discrete identity with an internal IP address and port space. However, the containers within a pod can share storage and network resources. CEE has a number of pods which have unique functions. Grafana and Postgress are among several CEE pods.

What is a Grafana Pod?

A Grafana pod can communicate to a Prometheus pod with access through the Prometheus Service, which is named Prometheus.

What is a Postgres Pod?

Postgres supports SQL databases with redundancy to store alerts and Grafana dashboards.

Problem

The Grafana pod restarts regularly, while the Postgres pods run with no problems.

To recover, use this command to delete the Grafana pod manually:

kubectl delete pod <grafana_pod_name> -n <cee_namespace

Upon deletion, the Grafana pod is recreated and restarted.

If the problem persists, use this CLI command to get the sample alert from CEE in order to identify the issue:

[pod-name-smf/podname] cee# show alerts active summary | include "POD_Res|k8s_grafana"

Example:

[pod-name-smf/podname] cee# show alerts active summary | include "POD_Res|k8s_grafana" Time Alert Name Description Port Access ID NEState Severity Alert Source 16:26 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_883 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX 16:23 PCF_k8s-pod-crashing-loop " "Processing Error Alarm"} " "Pod cee-dnrce301/grafana-59768df649-n6x6x (grafana) is restarting 1.03 times / 5 minutes." InService Critical NETX 16:20 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_882 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX 16:14 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_881 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX 16:08 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_880 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX 16:02 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_879 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX 15:56 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_878 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX 15:53 PCF_k8s-pod-crashing-loop " "Processing Error Alarm"} " "Pod cee-dnrce301/grafana-59768df649-n6x6x (grafana) is restarting 1.03 times / 5 minutes." InService Critical NETX

Workaround

Shutdown CEE

Run these commands from CEE in order to shut down:

[pod-name-smf/podname] cee# conf
Entering configuration mode terminal
[pod-name-smf/podname] cee(config)# system mode shutdown
[pod-name-smf/podname] cee(config)# commit
Commit complete.
[pod-name-smf/podname] cee(config)# end

Wait for the system to reach 100%.

Remove the DB Folders for Postgres Pods

Check to see the nodes where the Postgres pods spawned.
In this example, all Postgres pods are spawned on "master-1":

cloud-user@dnup0300-aio-1-master-1:~$ kubectl get pods -n cee-dnrce301 -o wide | grep postgres
postgres-0 1/1 Running 0 35d 10.108.50.28 dnup0300-aio-1-master-1 <none> <none>
postgres-1 1/1 Running 0 35d 10.108.50.47 dnup0300-aio-1-master-1 <none> <none>
postgres-2 1/1 Running 0 35d 10.108.50.102 dnup0300-aio-1-master-1 <none> <none>

There will be folders created for each Postgres in this path in the node to store DB:

/data/<cee-namespace>/postgres<0,1,2>

Remove those folders as shown:

cloud-user@dnup0300-aio-1-master-1:/data/cee-dnrce301$ sudo rm -rf data-postgres-0
cloud-user@dnup0300-aio-1-master-1:/data/cee-dnrce301$ sudo rm -rf data-postgres-1
cloud-user@dnup0300-aio-1-master-1:/data/cee-dnrce301$ sudo rm -rf data-postgres-2

Note: There could be cases where the folders "/data/<cee-namespace>/postgres<0,1,2>" are created on different nodes, such as master-1, master-2, master-3, and so on.

Restore CEE

Log into the Ops Center to restore the CEE and execute these CLI commands:

[pod-name-smf/podname] cee# conf
Entering configuration mode terminal
[pod-name-smf/podname] cee(config)# system mode running
[pod-name-smf/podname] cee(config)# commit
Commit complete.
[pod-name-smf/podname] cee(config)# end
[pod-name-smf/podname] cee# exit

Wait for the system to reach 100%.

Post checks

Verify Kubernetes from the Master

Run this command to check the status of the Grafana pod and other pods:

cloud-user@pod-name-smf-master-1:~$ kubectl get pods -A -o wide | grep grafana
cloud-user@pod-name-smf-master-1:~$ kubectl get pods -A -o wide

All pods should display UP and RUNNING without any restarts.

Verify Alerts are Cleared from CEE

Run this command to confirm that the alerts are cleared from CEE:

[pod-name-smf/podname] cee# show alerts active summary | include "POD_Res|k8s_grafana"

Revision History

Revision	Publish Date	Comments
1.0	25-Jan-2022	Initial Release

Contributed by Cisco Engineers

Arunkumaran R
Cisco TAC Engineer

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

Troubleshoot Guide for Continuous Restart of Grafana Pods

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Prerequisites

Requirements

Components Used

Background Information

What is SMI?

What is SMI CEE?

What are CEE Pods?

What is a Grafana Pod?

What is a Postgres Pod?

Problem

Workaround

Shutdown CEE

Remove the DB Folders for Postgres Pods

Restore CEE

Post checks

Verify Kubernetes from the Master

Verify Alerts are Cleared from CEE

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products