DRA_MESSAGE_
PROCESSING_FAILURE_
TPS_EXCEEDED
|
Critical
|
Message Processing Failure TPS exceeded, current value is {{ $value }}.
|
TPS of rejected messages from DRA Director (Any messages with Result code !=2001)
|
Clear
|
Message Processing Failure TPS in control.
|
DRA_DIRECTOR_
TPS_EXCEEDED
|
Critical
|
{{ $labels.instance }} Director TPS exceeded, current value is {{ $value }}.
|
Success TPS of Total DRA Director (ResultCode=2001)
|
Clear
|
{{ $labels.instance }} Director TPS in control .
|
DRA_WORKER_
TPS_EXCEEDED
|
Critical
|
{{ $labels.instance }} Worker TPS exceeded, current value is {{ $value }}.
|
TPS of Total Worker
|
Clear
|
{{ $labels.instance }} Worker TPS in control.
|
DRA_DB_
TPS_EXCEEDED
|
Critical
|
{{ $labels.instance }} Persistence DB TPS exceeded , current value is {{ $value }}.
|
TPS of DB TPS (Query and Update)
|
Clear
|
{{ $labels.instance }} Persistence DB TPS in control.
|
DIAMETER_UNABLE
_TO_DELIVER_
TPS_EXCEEDED
|
Critical
|
UNABLE_TO_DELIVER TPS exceeded, current value is {{ $value }}.
|
TPS of Diameter 3002
|
Clear
|
UNABLE_TO_DELIVER in control.
|
DIAMETER_TRANSIENT
_FAILURE_TPS_
EXCEEDED
|
Critical
|
TRANSIENT_FAILURE TPS exceeded, current value is {{ $value }}.
|
TPS of Diameter 4xxx
|
Clear
|
TRANSIENT_FAILURE in control.
|
DIAMETER_UNKNOWN
_SESSIONS_TPS
_EXCEEDED
|
Critical
|
UNKNOWN_SESSIONS TPS exceeded, current value is {{ $value }}.
|
TPS of Diameter 5002
|
Clear
|
UNKNOWN_SESSIONS in control.
|
MISMATCH_REQUEST
_RESPONSE
|
Critical
|
{{ $labels.remote_peer }} MISMATCH_REQUEST
_RESPONSE exceeded, current value is {{ $value }}.
|
Mismatch in Rate of Request and Response (Discrepancy in ingress and egress)
|
Clear
|
{{ $labels.remote_peer }} MISMATCH_REQUEST
_RESPONSE in control.
|
KEEP_ALIVE_RAR
_ROUTING_FAILURE_
TPS_EXCEEDED
|
Critical
|
Keep Alive RAR TPS exceeded, current value is {{ $value }}.
|
TPS of Keep Alive RAR Routing (Stale RAR)
|
Clear
|
Keep Alive RAR TPS in control.
|
EGRESS_RATE_
LIMITED_SESSION_
ERR_RESP_TPS_
EXCEEDED
|
Critical
|
{{ $labels.local_peer }} {{ $labels.remote_peer }} Egress rate limited messages with error response TPS exceeded, current
value is {{ $value }}.
|
TPS of Rate Limited Response for Error
|
Clear
|
{{ $labels.local_peer }} {{ $labels.remote_peer }} Egress rate limited messages with error response TPS in control.
|
EGRESS_RATE_
LIMITED_SESSION_
REJECT_TPS_
EXCEEDED
|
Critical
|
{{ $labels.local_peer }} {{ $labels.remote_peer }} Egress rate limited messages dropped without error TPS exceeded, current
value is {{ $value }}.
|
TPS of Rate Limited Response Rejected
|
Clear
|
{{ $labels.local_peer }}{{ $labels.remote_peer }} Egress rate limited messages dropped without error TPS in control.
|
INGRESS_RATE_
LIMITED_SESSION_
ERR_RESP_TPS_
EXCEEDED
|
Critical
|
{{ $labels.local_peer }} {{ $labels.remote_peer }} Ingress rate limited messages with error response TPS exceeded, current
value is {{ $value }}.
|
TPS of Rate Limited Response Error - Ingress
|
Clear
|
{{ $labels.local_peer }}{{ $labels.remote_peer }} Ingress rate limited messages with error response TPS in control.
|
INGRESS_RATE_
LIMITED_SESSION_
REJECT_TPS_
EXCEEDED
|
Critical
|
{{ $labels.local_peer }} {{ $labels.remote_peer }} Ingress rate limited messages dropped without error response TPS exceeded,
current value is {{ $value }}.
|
TPS of Rate Limited Response Rejected - Ingress
|
Clear
|
{{ $labels.local_peer }}{{ $labels.remote_peer }} Ingress rate limited messages dropped without error response TPS in control.
|
BINDING_STORAGE
_ERRORS_TPS_
EXCEEDED
|
Critical
|
Binding Store Error TPS exceeded, current value is {{ $value }}.
|
TPS Binding Storage Errors (Binding storage failed because of high load/any other database error)
|
Clear
|
Binding Store Error TPS in control.
|
BINDING_LOOKUP_
ERROR_TPS_
EXCEEDED
|
Critical
|
Binding Lookup Error TPS exceeded, current value is {{ $value }}.
|
TPS Binding Lookup Errors (Binding retrieval failure because of internal error)
|
Clear
|
Binding Lookup Error TPS in control.
|
DB_ERR_
TPS_EXCEEDED
|
Critical
|
All DB Errors TPS exceeded, current value is {{ $value }}.
|
TPS All database errors
|
Clear
|
All DB Errors TPS in control.
|
DB_RESPONSE_
TIME_EXCEEDED
|
Critical
|
{{ $labels.instance }} DB Response Time exceeded, current value is {{ $value }}.
|
Response Time Exceeds (Database Query/Update operation time exceeds)
|
Clear
|
{{ $labels.instance }} DB Response Time in control, current value is {{ $value }}.
|
BINDING_KEY_
NOT_FOUND_IN_
AAR_TPS_
EXCEEDED
|
Critical
|
{{ labels.origin_host }} Binding Key not found in AAR TPS exceeded, current value is {{ $value }}.
|
TPS Binding Key Not Found in AAR (When AAR received with no "imsi+apn/msisdn/ipv6")
|
Clear
|
{{ labels.origin_host }} Binding Key not found in AAR TPS in control.
|
BINDING_KEY_
NOT_FOUND_IN_
CCR_I_TPS_
EXCEEDED
|
Critical
|
{{ labels.origin_host }} Binding Key not found in CCR(I) TPS exceeded, current value is {{ $value }}.
|
TPS Binding Key Not Found in CCR-I(When CCR-I received with no "imsi+apn/msisdn/ipv6"
|
Clear
|
{{ labels.origin_host }} Binding Key not found in CCR(I) TPS in control.
|
BINDING_NOT
_FOUND_TPS_
EXCEEDED
|
Critical
|
{{ labels.origin_host }} Binding not found TPS exceeded, current value is {{ $value }}.
|
TPS Binding Not Found
|
Clear
|
{{ labels.origin_host }} Binding not found TPS in control,.
|
BINDING_DB_
INCONSISTENT_
TPS_EXCEEDED
|
Critical
|
TPS AAR with Result Code 5065 exceeded, current value is {{ $value }}.
|
TPS AAR with Result Code 5065
|
Clear
|
TPS AAR with Result Code 5065 in control.
|
BINDING_SESSION
_DB_SIZE_
EXCEEDED
|
Critical
|
{{ $labels.db }} size exceeded, current value is {{ $value }}.
|
Total Size of Session DB Exceeded
|
Clear
|
{{ $labels.db }} size in control.
|
BINDING_IMSI_
APN_DB_SIZE
_EXCEEDED
|
Critical
|
{{ $labels.db }} size exceeded, current value is {{ $value }}.
|
Total Size of IMSI / APN DB Exceeded
|
Clear
|
{{ $labels.db }} size in control.
|
BINDING_MSISDN
_APN_DB_SIZE
_EXCEEDED
|
Critical
|
{{ $labels.db }} size exceeded, current value is {{ $value }}.
|
Total Size of MSISDN / APN DB Exceeded
|
Clear
|
{{ $labels.db }} size in control
|
BINDING_IPV6
_DB_SIZE_
EXCEEDED
|
Critical
|
{{ $labels.db }} size exceeded, current value is {{ $value }}.
|
Total Size of IPv6 DB Exceeded
|
Clear
|
{{ $labels.db }} size in control
|
PEER_TPS
_EXCEEDED
|
Critical
|
{{ $labels.instance }} Peer Connection {{ $labels.local_peer}} {{ $labels.remote_peer }} TPS exceeded, current value is {{
$value }}.
|
Peer TPS Exceeded (Per peer TPS thresholds)
|
Clear
|
{{ $labels.instance }} Peer Connection {{ $labels.local_peer}} {{ $labels.remote_peer }} TPS in control.
|
NO_RESPONSE_
PEER_FOR_
ANSWER_TPS
_EXCEEDED
|
Critical
|
{{ $labels.instance }} No Response From Peer Connection TPS exceeded for {{ $labels.message_type}} , current value is {{ $value
}}.
|
TPS No Response From Peer (timeouts from PCRF/any peer)
|
Clear
|
{{ $labels.instance }} No Response From Peer Connection TPS in control for {{ $labels.message_type}} .
|
PEER_RESPONSE
_TIME_EXCEEDED
|
Critical
|
message_duration_seconds {type=~"peer_.*"} [labels: type]
|
Peer Response Time Exceeded (Response time of peer exceeds)
|
Clear
|
Response time in control.
|
NO_PEER_GROUP
_MEMBER
_AVAILABLE
|
Critical
|
{{ $labels.peer_group }} not available.
|
Peer Group is not Available (All peers in peer_group down)
|
Clear
|
{{ $labels.peer_group }} available.
|
PCRF_NOT_CREATING
_SESSIONS_TPS
_EXCEEDED
|
Critical
|
Failed CCR-I TPS exceeded, current value is {{ $value }}.
|
TPS Rate of Failed CCR-I(ResultCode !=2001)
|
Clear
|
Failed CCR-I TPS in control.
|
FORWARDING_LOOP
_FOUND_TPS
_EXCEEDED
|
Critical
|
{{ $labels.remote_peer}} Loop Detected TPS exceeded , current value is {{ $value }}.
|
TPS Rate of Diameter Message Loop
|
Clear
|
{{ $labels.remote_peer }} Loop Detected TPS in control.
|
RELAY_LINK
_TPS_GT_0
|
Critical
|
{{ $labels.remote_peer}} Relay Started, current value is {{ $value }}.
|
TPS Rate of Relay Peer > 0 (When relay peers start exchanging control plane messages)
|
Clear
|
{{ $labels.remote_peer}} Relay Stated.
|
RELAY_LINK
_TPS_EXCEEDED
|
Critical
|
{{ $labels.remote_peer}} Relay Link TPS exceeded, current value is {{ $value }}.
|
TPS Rate of Relay Peer (TPS of relay messages)
|
Clear
|
{{ $labels.remote_peer}} Relay Link TPS in control.
|
RELAY_LINK
_STATUS
|
Critical
|
{{ $labels.remote_peer }} Relay Link is Down.
|
Relay Link is Down (Relay link status is monitored)
|
Clear
|
{{ $labels.remote_peer}} Relay Link is UP.
|
NO_RELAY_PEER
_TPS_EXCEEDED
|
Critical
|
{{ $labels.remote_peer}} Relay Peer TPS exceeded, current value is {{ $value }}.
|
TPS Rate of Relay Peer Failure
|
Clear
|
{{ $labels.remote_peer}} Relay Peer TPS in control.
|
SESSION_DB_
LIMIT_EXCEEDED
|
Alert
|
Session max DB limit reached
|
This alarm is generated when session database count crosses maximum limit configured using CLI for db-max-record-limit.
|
Clear
|
Session max DB limit reached alarm cleared
|
This alarm is cleared when session database count drops below maximum limit configured using CLI for db-max-record-limit.
|
IPV6_DB_
LIMIT_EXCEEDED
|
Alert
|
IPv6 max DB limit reached
|
This alarm is generated when IPv6 database count crosses maximum limit configured using CLI for db-max-record-limit.
|
Clear
|
IPv6 max DB limit reached alarm cleared
|
This alarm is cleared when IPv6 database count drops below maximum limit configured using CLI for db-max-record-limit.
|
IPV4_DB_
LIMIT_EXCEEDED
|
Alert
|
IPv4 max DB limit reached
|
This alarm is generated when IPv4 database count crosses maximum limit configured using CLI for db-max-record-limit.
|
Clear
|
IPv4 max DB limit reached alarm cleared
|
This alarm is cleared when IPv4 database count drops below maximum limit configured using CLI for db-max-record-limit.
|
IMSIAPN_DB_
LIMIT_EXCEEDED
|
Alert
|
ImsiApn max DB limit reached
|
This alarm is generated when ImsiApn database count crosses maximum limit configured using CLI for db-max-record-limit.
|
Clear
|
ImsiApn max DB limit reached alarm cleared
|
This alarm is cleared when ImsiApn database count drops below maximum limit configured using CLI for db-max-record-limit.
|
MSISDNAPN_DB_
LIMIT_EXCEEDED
|
Alert
|
MsisdnApn max DB limit reached
|
This alarm is generated when MsisdnApn database count crosses maximum limit configured using CLI for db-max-record-limit.
|
Clear
|
MsisdnApn max DB limit reached alarm cleared
|
This alarm is cleared when MsisdnApn database count drops below maximum limit configured using CLI for db-max-record-limit.
|
CRD_CACHE_
LOAD_ERROR
|
Critical
|
Error when loading CRD cache
|
This alarm is generated when CRD is not loaded properly or CRD is loaded with an error value as “1”.
|
Clear
|
CRD cache loaded successfully
|
This alarm is cleared when CRD cache is updated properly with value as “0”.
|
APP_SERVICE_
HEALTH_STATUS_
CRD*
|
Critical
|
{{$labels.service}} service is Unhealthy!
|
This alarm is generated when CRD servcie is unhealthy if value is “1”
|
Clear
|
{{$labels.service}} service is Healthy.
|
This alarm is generated when CRD servcie is healthy if value is “0”
|
APP_SERVICE_
HEALTH_STATUS_
METADATA_DB*
|
Critical
|
{{$labels.service}} service is Unhealthy!
|
This alarm is generated when the Metadata DB service is unhealthy if value is “1”
|
Clear
|
{{$labels.service}} service is Healthy.
|
This alarm is generated when the Metadata DB servcie is healthy if value is “0”
|
VIP_NOT_ACTIVE_
ON_PREFERRED*
|
Critical
|
VIP {{ $labels.vip }} active on {{ $labels.currentHost }} and not active on preferred {{ $labels.preferredHost }}
|
This alarm is generated when the VIP is not present in preferred director or distributor.
|
Clear
|
VIP {{ $labels.vip }} active on preferred {{ $labels.preferredHost }}
|
This alarms is generated when the VIP is present in preferred director or distributor.
|
PEER_DYNAMIC_
RATE_LIMIT_
THROTTLING*
|
Critical
|
Dynamic Rate limit is active
|
This alarm is generated when any one peer connected to a director is in throttling mode.
sum(peer_dynamic_rate_
limit_throttling) != 0
|
Clear
|
Dynamic Rate limit is not active
|
This alarm is generated when no peer connected to a Director is in throttling mode.
sum(peer_dynamic_rate_
limit_throttling) == 0
|
NO_DB_CPU_
THRESHOLD_STATUS*
|
Critical
|
{{$labels.instance}} is not receiving any threshold message
|
Director is not receiving any threshold status messages from Worker.
sum(rate(processed_db_
cpu_control_message_
total [30s])) == 0
|
Clear
|
{{$labels.instance}} is receiving throttling messages
|
Director is receiving threshold status messages from Worker.
sum(rate(processed_db_
cpu_control_
message_total [30s])) != 0
|
QNS_LOGGING_
STOPPED*
|
Critical
|
Application logging has stopped on {{$labels.hostname}} at {{$labels.last_updated_time}} with connections closed {{$labels.tcp_closed}}
|
This alarm is generated when application has stopped logging consolidated-qns logs unexpectedly.
Note
|
If there is no activity on the system, and the alert is raised it is expected. It is resolved automatically when application
activity has started.
|
|
Clear
|
Application logging is successful on {{$labels.hostname}} at {{$labels.last_updated_time}}
|
This alarm is generated when application is successful logging consolidated-qns logs.
|
DRA_PCRF_
QUERY_NODE_
INACTIVE*
|
Critical
|
{{$labels.url_endpoint}} is Inactive!
|
This alarm is generated when PCRF REST endpoint URL hearbeat message fails if value is “1”.
|
Clear
|
{{$labels.url_endpoint}} is Active
|
This alarm is generated when PCRF REST endpoint URL hearbeat message is success if value is “0”.
|
DRA_PCRF_
QUERY_TPS_
EXCEEDED*
|
Critical
|
{{$labels.instance}} Pcrf Session Query TPS exceeded, current value is {{ $value }}
|
This alarm is generated when PCRF REST API TPS exceeds if the value is greater than “5”.
|
Clear
|
{{ $labels.instance }} Pcrf Session Query TPS in control
|
This alarm is generated when PCRF REST API TPS is under control if the value is less than “5”.
|
RELAY_TRAFFIC_
THRESHOLD_
EXCEEDED*
|
Critical
|
Relay traffic exceeded the threshold of 20%. Current value is {{ $value }}%
|
This alarm is generated if relay traffic exceeds certain % of total traffic.
|
Clear
|
Relay traffic % is under control
|
This alarm is generated if relay traffic is under certain % of total traffic.
|
LOCAL_
PUBLISH_
STOPPED*
|
Critical
|
Local publish stopped for {{ $labels.instance }}
|
This alarm is generated if topology is incomplete and global end point is missing.
|
Clear
|
Local publish started for {{ $labels.instance }}
|
This alarm is generated if topology is complete and global end point exists.
|
GLOBAL_
PUBLISH_
STOPPED*
|
Critical
|
Global publish stopped for {{ $labels.instance }}
|
This alarm is generated if topology is incomplete and local end point is missing.
|
Clear
|
Global publish started for {{ $labels.instance }}
|
This alarm is generated if topology is complete and local end point exists.
|
DIAMETER_ENDPOINTS_
MISSING_
LOST_REDIS*
|
Critical
|
Diameter Endpoints missing due to Redis connection lost
|
This alarms is generated if Diameter endpoint is missing REDIS configuration.
|
Clear
|
Redis connection restored. Diameter Endpoints are restored
|
This alarms is generated if REDIS configuration exists in Diameter endpoint.
|
DIAMETER_PEER_
EXPIRATIONS_
EXCEEDED*
|
Critical
|
{{$labels.origin_host}} got EXPIRED in {{$labels.system}}
|
This alarm is generated if any peer has expired.
|
Clear
|
Peer expiration got reset for {{$labels.origin_host}}
|
This alarm is generated if the peer expiration is reset.
|
ELASTICSEARCH_NOT_
REACHABLE
|
Critical
|
Elasticsearch server is unreachable with status {{$labels.reachable_status}} with tcp connection status {{$labels.tcp_connected}} |
This alarm is generated when elasticsearch is not reachable to DRA or the TCP connections are not healthy.
|
Clear
|
Elasticsearch server is reachable now !!!
|
This alarms is generated when the elasticsearch is reachable to DRA or the TCP connections are healthy.
|
TLS_CERT_EXPIRY
|
Critical, Major, and Minor
|
certificate will expire in {{$value}} days!
|
This alarm monitors the expiry date for TLS certificate.
|