Introduction
This document describes a workaround to recover a Grafana pod that restarts continuously.
Prerequisites
Requirements
Cisco recommends that you have knowledge of these topics:
- Cisco Subscriber Microservices Infrastructure (SMI) Ultra Cloud Core Common Execution Environment (CEE)
- 5G Cloud Native Deployment Platform (CNDP) or SMI-bare-metal (BM) architecture
- Dockers and Kubernetes
Components Used
The information in this document is based on these software and hardware versions:
- SMI 2020.02.2.35
- Kubernetes v1.21.0
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
Background Information
What is SMI?
Cisco SMI is a layered stack of cloud technologies and standards that enable microservices-based applications from the Cisco Mobility, Cable, and Broadband Network Gateway (BNG) business units. These applications have similar subscriber management functions and similar datastore requirements.
Attributes:
- The Layer Cloud Stack (technologies and standards) provides top-to-bottom deployments and accommodates current cloud infrastructures.
- All applications share the CEE for non-application functions (data storage, deployment, configuration, telemetry, alarm), which provides a consistent interaction and experience for all customer touchpoints and integration points.
- Applications and the CEE are deployed in microservice containers and are connected with an Intelligent Service Mesh.
- An exposed API for deployment, configuration, and management enables automation.
What is SMI CEE?
- The CEE is a software solution that was developed to monitor mobile and cable applications that are deployed on the SMI. The CEE captures information (key metrics) from the applications in a centralized way for engineers to debug and troubleshoot.
- The CEE is the common set of tools that are installed for all the applications. It comes equipped with a dedicated Ops Center, which provides the Command Line Interface (CLI) and APIs to manage the monitor tools. Only one CEE is available for each cluster.
What are CEE Pods?
- A pod is a process that runs on your Kubernetes cluster. A pod encapsulates a granular unit that is known as a container. A pod contains one or multiple containers.
- Kubernetes deploys one or multiple pods on a single node, which can be a physical or virtual machine. Each pod has a discrete identity with an internal IP address and port space. However, the containers within a pod can share storage and network resources. CEE has a number of pods which have unique functions. Grafana and Postgress are among several CEE pods.
What is a Grafana Pod?
A Grafana pod can communicate to a Prometheus pod with access through the Prometheus Service, which is named Prometheus.
What is a Postgres Pod?
Postgres supports SQL databases with redundancy to store alerts and Grafana dashboards.
Problem
The Grafana pod restarts regularly, while the Postgres pods run with no problems.
To recover, use this command to delete the Grafana pod manually:
kubectl delete pod <grafana_pod_name> -n <cee_namespace
Upon deletion, the Grafana pod is recreated and restarted.
If the problem persists, use this CLI command to get the sample alert from CEE in order to identify the issue:
[pod-name-smf/podname] cee# show alerts active summary | include "POD_Res|k8s_grafana"
Example:
[pod-name-smf/podname] cee# show alerts active summary | include "POD_Res|k8s_grafana"
Time Alert Name Description Port Access ID NEState Severity Alert Source
16:26 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_883 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX
16:23 PCF_k8s-pod-crashing-loop " "Processing Error Alarm"} " "Pod cee-dnrce301/grafana-59768df649-n6x6x (grafana) is restarting 1.03 times / 5 minutes." InService Critical NETX
16:20 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_882 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX
16:14 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_881 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX
16:08 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_880 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX
16:02 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_879 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX
15:56 PCF_POD_Restarted " "Processing Error Alarm"} " ""Container k8s_grafana_grafana-59768df649-n6x6x_cee-dnrce301_a4ff5711-0e20-4dd4-ae7f-47296c334930_878 of pod grafana-59768df649-n6x6x in namespace cee-dnrce301 has been restarted."" InService Major NETX
15:53 PCF_k8s-pod-crashing-loop " "Processing Error Alarm"} " "Pod cee-dnrce301/grafana-59768df649-n6x6x (grafana) is restarting 1.03 times / 5 minutes." InService Critical NETX
Workaround
Shutdown CEE
Run these commands from CEE in order to shut down:
[pod-name-smf/podname] cee# conf
Entering configuration mode terminal
[pod-name-smf/podname] cee(config)# system mode shutdown
[pod-name-smf/podname] cee(config)# commit
Commit complete.
[pod-name-smf/podname] cee(config)# end
Wait for the system to reach 100%.
Remove the DB Folders for Postgres Pods
Check to see the nodes where the Postgres pods spawned.
In this example, all Postgres pods are spawned on "master-1":
cloud-user@dnup0300-aio-1-master-1:~$ kubectl get pods -n cee-dnrce301 -o wide | grep postgres
postgres-0 1/1 Running 0 35d 10.108.50.28 dnup0300-aio-1-master-1 <none> <none>
postgres-1 1/1 Running 0 35d 10.108.50.47 dnup0300-aio-1-master-1 <none> <none>
postgres-2 1/1 Running 0 35d 10.108.50.102 dnup0300-aio-1-master-1 <none> <none>
There will be folders created for each Postgres in this path in the node to store DB:
/data/<cee-namespace>/postgres<0,1,2>
Remove those folders as shown:
cloud-user@dnup0300-aio-1-master-1:/data/cee-dnrce301$ sudo rm -rf data-postgres-0
cloud-user@dnup0300-aio-1-master-1:/data/cee-dnrce301$ sudo rm -rf data-postgres-1
cloud-user@dnup0300-aio-1-master-1:/data/cee-dnrce301$ sudo rm -rf data-postgres-2
Note: There could be cases where the folders "/data/<cee-namespace>/postgres<0,1,2>"
are created on different nodes, such as master-1, master-2, master-3, and so on.
Restore CEE
Log into the Ops Center to restore the CEE and execute these CLI commands:
[pod-name-smf/podname] cee# conf
Entering configuration mode terminal
[pod-name-smf/podname] cee(config)# system mode running
[pod-name-smf/podname] cee(config)# commit
Commit complete.
[pod-name-smf/podname] cee(config)# end
[pod-name-smf/podname] cee# exit
Wait for the system to reach 100%.
Post checks
Verify Kubernetes from the Master
Run this command to check the status of the Grafana pod and other pods:
cloud-user@pod-name-smf-master-1:~$ kubectl get pods -A -o wide | grep grafana
cloud-user@pod-name-smf-master-1:~$ kubectl get pods -A -o wide
All pods should display UP and RUNNING without any restarts.
Verify Alerts are Cleared from CEE
Run this command to confirm that the alerts are cleared from CEE:
[pod-name-smf/podname] cee# show alerts active summary | include "POD_Res|k8s_grafana"