The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
This document describes how to troubleshoot Drain of Unprocessed Events and Frequent Drain of Events health alerts on Firepower Management Center.
The Firepower Management Center (FMC) generates one of these health alerts:
Although these events are generated and shown on the FMC, they relate to a managed device sensor whether it is a Firepower Threat Defense (FTD) device or a Next-Generation Intrusion Prevention System (NGIPS) device. For the rest of this document, the term sensor refers to both FTD and NGIPS devices alike unless otherwise specified.
This is the health alert structure:
In this example, the SILO NAME is Unified Low Priority Events. This is one of the disk manager silos (see Background Information section for a more comprehensive explanation).
Additionally:
Additional symptoms can include:
A Frequent drain of <SILO NAME> event is caused by too much input into the silo for its size. In this case, the disk manager drains (purges) that file at least twice in the last 5-minute interval. In an event type silo, this is typically caused by excessive logging of that event type. In the case of a Drain of unprocessed events of <SILO NAME> health alert, this can also be caused by a bottleneck in the event processing path.
In the diagram there are 3 potential bottlenecks:
To understand deeper the Event Processing architecture, refer to the respective Deep Dive section.
As stated in the previous section, one of the most common causes for the health alerts of this type is excessive input.
The difference between the Low Water Mark (LWM) and High Water Mark (HWM) gathered from the show disk-manager CLISH command shows how much space there is needed to take on that silo to go from LWM (freshly drained) to the HWM value. If there are frequent drain of events (with or without unprocessed events) the first thing you must review is the logging configuration.
For an in-depth explanation of the Disk Manager process, refer to the respective Deep Dive section.
Whether it is double logging or just a high rate of events on the overall manager-sensors ecosystem, a review of the logging settings must be done.
Step 1. Check for double logging.
Double logging scenarios can be identified if you look at the correlator perfstats on the FMC as shown in this output:
admin@FMC:~$ sudo perfstats -Cq < /var/sf/rna/correlator-stats/now
129 statistics lines read
host limit: 50000 0 50000
pcnt host limit in use: 0.01 0.01 0.01
rna events/second: 0.00 0.00 0.06
user cpu time: 0.48 0.21 10.09
system cpu time: 0.47 0.00 8.83
memory usage: 2547304 0 2547304
resident memory usage: 28201 0 49736
rna flows/second: 126.41 0.00 3844.16
rna dup flows/second: 69.71 0.00 2181.81
ids alerts/second: 0.00 0.00 0.00
ids packets/second: 0.00 0.00 0.00
ids comm records/second: 0.02 0.01 0.03
ids extras/second: 0.00 0.00 0.00
fw_stats/second: 0.00 0.00 0.03
user logins/second: 0.00 0.00 0.00
file events/second: 0.00 0.00 0.00
malware events/second: 0.00 0.00 0.00
fireamp events/second: 0.00 0.00 0.00
In this case, a high rate of duplicated flows can be seen in the output.
Step 2. Review the logging settings of the ACP.
You must start with a review of the logging settings of the Access Control Policy (ACP). Ensure that you use the best practices described in this document Best Practices for Connection Logging
A review of the logging settings is advisable in all situations as the recommendations listed do not just cover double logging scenarios.
To check the rate of generated events on FTD, check this file and focus on the TotalEvents and PerSec columns:
admin@firepower:/ngfw/var/log$ sudo more EventHandlerStats.2023-08-13 | grep Total | more
{"Time": "2023-08-13T00:03:37Z", "TotalEvents": 298, "PerSec": 0, "UserCPUSec": 0.995, "SysCPUSec": 4.598, "%CPU": 1.9, "MemoryKB": 33676}
{"Time": "2023-08-13T00:08:37Z", "TotalEvents": 298, "PerSec": 0, "UserCPUSec": 1.156, "SysCPUSec": 4.280, "%CPU": 1.8, "MemoryKB": 33676}
{"Time": "2023-08-13T00:13:37Z", "TotalEvents": 320, "PerSec": 1, "UserCPUSec": 1.238, "SysCPUSec": 4.221, "%CPU": 1.8, "MemoryKB": 33676}
{"Time": "2023-08-13T00:18:37Z", "TotalEvents": 312, "PerSec": 1, "UserCPUSec": 1.008, "SysCPUSec": 4.427, "%CPU": 1.8, "MemoryKB": 33676}
{"Time": "2023-08-13T00:23:37Z", "TotalEvents": 320, "PerSec": 1, "UserCPUSec": 0.977, "SysCPUSec": 4.465, "%CPU": 1.8, "MemoryKB": 33676}
{"Time": "2023-08-13T00:28:37Z", "TotalEvents": 299, "PerSec": 0, "UserCPUSec": 1.066, "SysCPUSec": 4.361, "%CPU": 1.8, "MemoryKB": 33676}
Step 3. Check whether the excessive logging is expected or not.
You must review whether the excessive Logging has an expected cause or not. If the excessive logging is cause by DOS/DDoS attack or routing loop or a specific application/host that makes a huge number of connections, you must check and mitigate/stop connections from the unexpected excessive connection sources.
Step 4. Check for corrupted diskmanager.log file.
Normally, an entry can have 12 comma-separated values. To check for corrupted lines that have a different number of fields:
admin@firepower:/ngfw/var/log$ sudo cat diskmanager.log | awk -F',' 'NF != 12 {print}'
admin@firepower:/ngfw/var/log$
If there is a corrupted line with different than 12 fields are shown.
Step 5. Upgrade model.
Upgrade FTD hardware device to higher performance model (for example FPR2100 --> FPR4100), source of silo would increase.
Step 6. Consider if you can disable Log to Ramdisk.
In the case of the Unified Low Priority Events silo, you can disable Log to Ramdisk to increase the silo size with the drawbacks discussed in the respective Deep Dive section.
Another common cause for this type of alert is connectivity issues and/or instability in the communication channel (sftunnel) between the sensor and the FMC. The communication issue can be due to:
For the sftunnel connectivity issue, ensure that the FMC and the sensor have reachability between their management interfaces on TCP port 8305.
On FTD you can search for sftunneld string in the [/ngfw]/var/log/messages file. Connectivity issues cause messages like these to be generated:
Sep 9 15:41:35 firepower SF-IMS[5458]: [27602] sftunneld:sf_ch_util [INFO] Delay for heartbeat reply on channel from 10.62.148.75 for 609 seconds. dropChannel... Sep 9 15:41:35 firepower SF-IMS[5458]: [27602] sftunneld:sf_connections [INFO] Ping Event Channel for 10.62.148.75 failed Sep 9 15:41:35 firepower SF-IMS[5458]: [27602] sftunneld:sf_channel [INFO] >> ChannelState dropChannel peer 10.62.148.75 / channelB / EVENT [ msgSock2 & ssl_context2 ] << Sep 9 15:41:35 firepower SF-IMS[5458]: [27602] sftunneld:sf_channel [INFO] >> ChannelState freeChannel peer 10.62.148.75 / channelB / DROPPED [ msgSock2 & ssl_context2 ] << Sep 9 15:41:35 firepower SF-IMS[5458]: [27602] sftunneld:sf_connections [INFO] Need to send SW version and Published Services to 10.62.148.75 Sep 9 15:41:35 firepower SF-IMS[5458]: [27602] sftunneld:sf_peers [INFO] Confirm RPC service in CONTROL channel Sep 9 15:41:35 firepower SF-IMS[5458]: [27602] sftunneld:sf_channel [INFO] >> ChannelState do_dataio_for_heartbeat peer 10.62.148.75 / channelA / CONTROL [ msgSock & ssl_context ] << Sep 9 15:41:48 firepower SF-IMS[5458]: [5464] sftunneld:tunnsockets [INFO] Started listening on port 8305 IPv4(10.62.148.180) management0 Sep 9 15:41:51 firepower SF-IMS[5458]: [27602] sftunneld:control_services [INFO] Successfully Send Interfaces info to peer 10.62.148.75 over managemen Sep 9 15:41:53 firepower SF-IMS[5458]: [5465] sftunneld:sf_connections [INFO] Start connection to : 10.62.148.75 (wait 10 seconds is up) Sep 9 15:41:53 firepower SF-IMS[5458]: [27061] sftunneld:sf_peers [INFO] Peer 10.62.148.75 needs the second connection Sep 9 15:41:53 firepower SF-IMS[5458]: [27061] sftunneld:sf_ssl [INFO] Interface management0 is configured for events on this Device Sep 9 15:41:53 firepower SF-IMS[5458]: [27061] sftunneld:sf_ssl [INFO] Connect to 10.62.148.75 on port 8305 - management0 Sep 9 15:41:53 firepower SF-IMS[5458]: [27061] sftunneld:sf_ssl [INFO] Initiate IPv4 connection to 10.62.148.75 (via management0) Sep 9 15:41:53 firepower SF-IMS[5458]: [27061] sftunneld:sf_ssl [INFO] Initiating IPv4 connection to 10.62.148.75:8305/tcp Sep 9 15:41:53 firepower SF-IMS[5458]: [27061] sftunneld:sf_ssl [INFO] Wait to connect to 8305 (IPv6): 10.62.148.75
Oversubscription of the FMCs management interface can be a spike in management traffic or a constant oversubscription. Historical data from the Heath Monitor is a good indicator of this.
The first thing to note is that in most cases the FMC is deployed with a single NIC for management. This interface is used for:
You can deploy a second NIC on the FMC for an event dedicated interface. Implementations can depend on the use case.
General guidelines can be found in the FMC Hardware Guide Deploying on a Management Network
The last scenario to be covered is when the bottleneck happens on the SFDataCorrelator side (FMC).
The first step is to look into the diskmanager.log file as there is important information to be gathered such as:
The occurrence of the drain with Unprocessed Events.
For information about the diskmanager.log file and how to interpret it, you can refer to the Disk Manager section. The information gathered from the diskmanager.log can be used to help narrow down the subsequent steps.
Additionally, you need to look at the correlator performance statistics:
admin@FMC:~$ sudo perfstats -Cq < /var/sf/rna/correlator-stats/now
129 statistics lines read
host limit: 50000 0 50000 pcnt host limit in use: 100.01 100.00 100.55 rna events/second: 1.78 0.00 48.65 user cpu time: 2.14 0.11 58.20 system cpu time: 1.74 0.00 41.13 memory usage: 5010148 0 5138904 resident memory usage: 757165 0 900792 rna flows/second: 101.90 0.00 3388.23 rna dup flows/second: 0.00 0.00 0.00 ids alerts/second: 0.00 0.00 0.00 ids packets/second: 0.00 0.00 0.00 ids comm records/second: 0.02 0.01 0.03 ids extras/second: 0.00 0.00 0.00 fw_stats/second: 0.01 0.00 0.08 user logins/second: 0.00 0.00 0.00 file events/second: 0.00 0.00 0.00 malware events/second: 0.00 0.00 0.00 fireamp events/second: 0.00 0.00 0.01
These statistics are for the FMC and they correspond to the aggregate of all the sensors managed by it. In the case of Unified low priority events you mainly look for:
Based on the output it can be concluded that:
More information about the SFDataCorrelator process can be found under the Event Processing section.
First, you need to determine when the spike occurred. To do this you need to look at the correlator statistics per each 5-minute sample interval. The information gathered from the diskmanager.log can help you to go straight to the important timeframe.
Tip: Pipe the output to the Linux pager less so that you can easily search.
admin@FMC:~$ sudo perfstats -C < /var/sf/rna/correlator-stats/now
<OUTPUT OMITTED FOR READABILITY>
Wed Sep 9 16:01:35 2020 host limit: 50000 pcnt host limit in use: 100.14 rna events/second: 24.33 user cpu time: 7.34 system cpu time: 5.66 memory usage: 5007832 resident memory usage: 797168 rna flows/second: 638.55 rna dup flows/second: 0.00 ids alerts/second: 0.00 ids pkts/second: 0.00 ids comm records/second: 0.02 ids extras/second: 0.00 fw stats/second: 0.00 user logins/second: 0.00 file events/second: 0.00 malware events/second: 0.00 fireAMP events/second: 0.00 Wed Sep 9 16:06:39 2020 host limit: 50000 pcnt host limit in use: 100.03 rna events/second: 28.69 user cpu time: 16.04 system cpu time: 11.52 memory usage: 5007832 resident memory usage: 801476 rna flows/second: 685.65 rna dup flows/second: 0.00 ids alerts/second: 0.00 ids pkts/second: 0.00 ids comm records/second: 0.01 ids extras/second: 0.00 fw stats/second: 0.00 user logins/second: 0.00 file events/second: 0.00 malware events/second: 0.00 fireAMP events/second: 0.00 Wed Sep 9 16:11:42 2020 host limit: 50000 pcnt host limit in use: 100.01 rna events/second: 47.51 user cpu time: 16.33 system cpu time: 12.64 memory usage: 5007832 resident memory usage: 809528 rna flows/second: 1488.17 rna dup flows/second: 0.00 ids alerts/second: 0.00 ids pkts/second: 0.00 ids comm records/second: 0.02 ids extras/second: 0.00 fw stats/second: 0.01 user logins/second: 0.00 file events/second: 0.00 malware events/second: 0.00 fireAMP events/second: 0.00 Wed Sep 9 16:16:42 2020 host limit: 50000 pcnt host limit in use: 100.00 rna events/second: 8.57 user cpu time: 58.20 system cpu time: 41.13 memory usage: 5007832 resident memory usage: 837732 rna flows/second: 3388.23 rna dup flows/second: 0.00 ids alerts/second: 0.00 ids pkts/second: 0.00 ids comm records/second: 0.01 ids extras/second: 0.00 fw stats/second: 0.03 user logins/second: 0.00 file events/second: 0.00 malware events/second: 0.00 fireAMP events/second: 0.00 197 statistics lines read host limit: 50000 0 50000 pcnt host limit in use: 100.01 100.00 100.55 rna events/second: 1.78 0.00 48.65 user cpu time: 2.14 0.11 58.20 system cpu time: 1.74 0.00 41.13 memory usage: 5010148 0 5138904 resident memory usage: 757165 0 900792 rna flows/second: 101.90 0.00 3388.23 rna dup flows/second: 0.00 0.00 0.00 ids alerts/second: 0.00 0.00 0.00 ids packets/second: 0.00 0.00 0.00 ids comm records/second: 0.02 0.01 0.03 ids extras/second: 0.00 0.00 0.00 fw_stats/second: 0.01 0.00 0.08 user logins/second: 0.00 0.00 0.00 file events/second: 0.00 0.00 0.00 malware events/second: 0.00 0.00 0.00 fireamp events/second: 0.00 0.00 0.01
Use the information in the output to:
In the previous example, there is an obvious spike in the rate of events received at 16:06:39 and beyond. These are 5-minute averages so the increase can be more abrupt than shown (burst) but diluted in this 5-minute interval if it started towards the end of it.
Although this leads to the conclusion that this spike of events caused the Drain of unprocessed events, you can take a look at the connection events from the FMC Graphical User Interface (GUI) with the appropriate time window to understand what type of connections traversed the FTD box in this spike:
Apply this time window to get the filtered Connection Events. Do not forget to account for the timezone. In this example, the sensor uses UTC and the FMC UTC+1. Use the Table View to see the events that triggered the overload of events and take action accordingly:
Based on the timestamps (time of the first and last packet) it can be seen that these are short-lived connections. Furthermore, the Initiator and Responder Packets columns show that there was only 1 packet exchanged in each direction. This confirms that the connections were short-lived and exchanged very little data.
You can also see that all these flows target the same responder IPs and port. Also, they are all reported by the same sensor (which alongside Ingress and Egress interface information can speak to the place and direction of this flows). Additional actions:
Note: The intent of this article is to provide guidelines to troubleshoot the Drain of Unprocessed Events alert. This example used hping3 to generate a TCP SYN flood to the destination server. For guidelines to harden your FTD device check the Cisco Firepower Threat Defense Hardening Guide
It is highly recommended to collect these items before you contact Cisco TAC:
This section covers an in-depth explanation of the various components that can take part in this type of health alerts. This includes:
To understand the Drain of Events health alerts and be able to identify potential failure points, there is a need to look into how these components work and interact with each other.
Even though the Frequent Drain type of health alerts can be triggered by silos that are not event-related, the vast majority of the cases seen by Cisco TAC are related to drain of event-related information. Additionally, to understand what constitutes a drain of unprocessed events there is a need to take a look at the event processing architecture and the components that constitute it.
When a Firepower sensor receives a packet from a new connection the snort process generates an event in unified2 format which is a binary format that allows for quicker read/writes as well as lighter events.
The output shows the FTD command system support trace where you can see a new connection created. The important parts are highlighted and explained:
192.168.0.2-42310 - 192.168.1.10-80 6 AS 1-1 CID 0 Packet: TCP, SYN, seq 3310981951
192.168.0.2-42310 - 192.168.1.10-80 6 AS 1-1 CID 0 Session: new snort session
192.168.0.2-42310 - 192.168.1.10-80 6 AS 1-1 CID 0 AppID: service unknown (0), application unknown (0)
192.168.0.2-42310 > 192.168.1.10-80 6 AS 1-1 I 0 new firewall session
192.168.0.2-42310 > 192.168.1.10-80 6 AS 1-1 I 0 using HW or preset rule order 4, 'Default Inspection', action Allow and prefilter rule 0
192.168.0.2-42310 > 192.168.1.10-80 6 AS 1-1 I 0 HitCount data sent for rule id: 268437505,
192.168.0.2-42310 > 192.168.1.10-80 6 AS 1-1 I 0 allow action
192.168.0.2-42310 - 192.168.1.10-80 6 AS 1-1 CID 0 Firewall: allow rule, 'Default Inspection', allow
192.168.0.2-42310 - 192.168.1.10-80 6 AS 1-1 CID 0 Snort id 0, NAP id 1, IPS id 0, Verdict PASS
Snort unified_events files are generated per instance under the path [/ngfw]var/sf/detection_engine/*/instance-N/, where:
There can be 2 types of unified_events files in any given Snort instance folder:
A high priority event is an event that corresponds to a potentially malicious connection.
Types of events and their priority:
High Priority (1) |
Low Priority (2) |
Intrusion |
Connection |
Malware |
Discovery |
Security Intelligence |
File |
Associated Connection Events |
Statistics |
The next output shows an event that belongs to the new connection traced in the previous example. The format is unified2 and is taken from the output of the respective unified event log located under [/ngfw]/var/sf/detection_engine/*/instance-1/ where 1 is the snort instance id in bold in the previous output +1. The unified event log format name uses the syntax unified_events-2.log.1599654750 where 2 stands for the priority of the events as shown in the table and the last portion in bold (1599654750) is the timestamp (Unix time) of when the file was created.
Tip: You can use the Linux date command to convert the Unix time into a readable date:
admin@FP1120-2:~$ sudo date -d@1599654750
Wed Sep 9 14:32:30 CEST 2020
Unified2 Record at offset 2190389
Type: 210(0x000000d2)
Timestamp: 0
Length: 765 bytes
Forward to DC: Yes
FlowStats:
Sensor ID: 0
Service: 676
NetBIOS Domain: <none>
Client App: 909, Version: 1.20.3 (linux-gnu)
Protocol: TCP
Initiator Port: 42310
Responder Port: 80
First Packet: (1599662092) Tue Sep 9 14:34:52 2020
Last Packet: (1599662092) Tue Sep 9 14:34:52 2020
<OUTPUT OMITTED FOR READABILITY>
Initiator: 192.168.0.2
Responder: 192.168.1.10
Original Client: ::
Policy Revision: 00000000-0000-0000-0000-00005f502a92
Rule ID: 268437505
Tunnel Rule ID: 0
Monitor Rule ID: <none>
Rule Action: 2
Alongside every unified_events file there is a bookmark file, which contains 2 important values:
The values are in order separated by a comma as shown in this example:
root@FTD:/home/admin# cat /var/sf/detection_engines/d5a4d5d0-6ddf-11ea-b364-2ac815c16717/instance-1/unified_events-2.log.bookmark.1a3d52e6-3e09-11ea-838f-68e7af919059
1599862498, 18754115
This allows the disk manager process to know which events have already been processed (sent to FMC) and which ones have not.
Note: When the disk manager drains an event silo, it removes unified event files.
For more information about the drain of silos read the Disk Manager section.
A drained unified file is deemed to have unprocessed events when one of these is true:
The EventHandler process reads events from the unified files and streams them to the FMC (as metadata) via sftunnel, which is the process responsible for encrypted communication between the sensor and the FMC. This is a TCP based connection so the event streaming is acknowledged by the FMC
You can see these messages in the [/ngfw]/var/log/messages file:
sfpreproc:OutputFile [INFO] *** Opening /ngfw/var/sf/detection_engines/77d31ce2-c2fc-11ea-b470-d428d53ed3ae/instance-1/unified_events-2.log.1597810478 for output" in /var/log/messages
EventHandler:SpoolIterator [INFO] Opened unified event file /var/sf/detection_engines/77d31ce2-c2fc-11ea-b470-d428d53ed3ae/instance-1/unified_events-2.log.1597810478
sftunneld:FileUtils [INFO] Processed 10334 events from log file var/sf/detection_engines/77d31ce2-c2fc-11ea-b470-d428d53ed3ae/instance-1/unified_events-2.log.1597810478
This output provides this information:
The bookmark file is then updated accordingly. The sftunnel uses 2 different channels called Unified Events (UE) Channel 0 and 1 for high and low priority events respectively.
With the sfunnel_status CLI command on the FTD, you can see the number of events that were streamed.
Priority UE Channel 1 service
TOTAL TRANSMITTED MESSAGES <530541> for UE Channel service
RECEIVED MESSAGES <424712> for UE Channel service
SEND MESSAGES <105829> for UE Channel service
FAILED MESSAGES <0> for UE Channel service
HALT REQUEST SEND COUNTER <17332> for UE Channel service
STORED MESSAGES for UE Channel service (service 0/peer 0)
STATE <Process messages> for UE Channel service
REQUESTED FOR REMOTE <Process messages> for UE Channel service
REQUESTED FROM REMOTE <Process messages> for UE Channel service
In the FMC, the events are received by the SFDataCorrelator process.
The status of events that were processed from each sensor can be seen with the stats_unified.pl command:
admin@FMC:~$ sudo stats_unified.pl
Current Time - Fri Sep 9 23:00:47 UTC 2020
**********************************************************************************
* FTD - 60a0526e-6ddf-11ea-99fa-89a415c16717, version 6.6.0.1
**********************************************************************************
Channel Backlog Statistics (unified_event_backlog)
Chan Last Time Bookmark Time Bytes Behind
0 2020-09-09 23:00:30 2020-09-07 10:41:50 0
1 2020-09-09 23:00:30 2020-09-09 22:14:58 6960
This command shows the status of the backlog of events for a certain device per channel, the Channel ID used is the same as the sftunnel.
The Bytes Behind value can be computed as the difference between the position shown in the unified event bookmark file and the size of the unified event file, plus any subsequent file with a higher timestamp than the one in the bookmark file.
The SFDataCorrelator process also stores performance statistics, those are saved in /var/sf/rna/correlator-stats/. One file is created per day to store the performance statistics for that day in CSV format. The name of the file uses the format YYYY-MM-DD and the file correspondent to the current day is called now.
The statistics are gathered every 5 minutes (there is one line per each 5-minute interval).
The output of this file can be read with the perfstats command.
Note: This command is also used to read snort performance statistics files, so the appropriate flags must be used:
-C: Instructs perfstats that the input is a correlator-stats file (without this flag, perfstats assumes the input is a snort performance statistics file).
-q: Quiet mode, prints only the summary for the file.
admin@FMC:~$ sudo perfstats -Cq < /var/sf/rna/correlator-stats/now
287 statistics lines read
host limit: 50000 0 50000
pcnt host limit in use: 100.01 100.00 100.55
rna events/second: 1.22 0.00 48.65
user cpu time: 1.56 0.11 58.20
system cpu time: 1.31 0.00 41.13
memory usage: 5050384 0 5138904
resident memory usage: 801920 0 901424
rna flows/second: 64.06 0.00 348.15
rna dup flows/second: 0.00 0.00 37.05
ids alerts/second: 1.49 0.00 4.63
ids packets/second: 1.71 0.00 10.10
ids comm records/second: 3.24 0.00 12.63
ids extras/second: 0.01 0.00 0.07
fw_stats/second: 1.78 0.00 5.72
user logins/second: 0.00 0.00 0.00
file events/second: 0.00 0.00 3.25
malware events/second: 0.00 0.00 0.06
fireamp events/second: 0.00 0.00 0.00
Each row in the summary has 3 values in this order: Average, Minimum, Maximum.
If you print without the -q flag, you also see the 5-minute interval values. The summary is shown in the end.
Note: Each FMC has a Maximum flow rate described on its datasheet. The next table contains the values per module taken from the respective datasheet.
Model |
FMC 750 |
FMC 1000 |
FMC 1600 |
FMC 2000 |
FMC 2500 |
FMC 2600 |
FMC 4000 |
FMC 4500 |
FMC 4600 |
FMCv |
FMCv300 |
Maximum Flow Rate (fps) |
2000 |
5000 |
5000 |
12000 |
12000 |
12000 |
20000 |
20000 |
20000 |
Variable |
12000 |
Note: These values are for the aggregate of all event types shown in bold on the SFDataCorrelator statistics output.
If you look at the output and you size the FMC in such a way that you are prepared for the worst-case scenario (when all the maximum values happen at the same time), then the rate of events this FMC sees is 48.65 + 348.15 + 4.63 + 3.25 + 0.06 = 404.74 fps.
This total value can be compared with the value from the datasheet of the respective model.
The SFDataCorrelator can also make additional work on top of the received events (such as for Correlation Rules), it then stores them into the database which is queried to populate various information in the FMC Graphical User Interface (GUI) such as Dashboards and Event Views.
The next logical diagram shows the logical components for both the Health Monitor and Disk Manager processes as they are intertwined for the generation of disk-related health alerts.
In a nutshell, the disk manager process manages the disk usage of the box and it has its configuration files in the [/ngfw]/etc/sf/ folder. There are multiple configuration files for the disk manager process that are used under certain circumstances:
Each type of file monitored by the disk manager is assigned a Silo. Based on the amount of disk space available on the system the disk manager computes a High Water Mark (HWM) and a Low Water Mark (LWM) for each silo.
When the disk manager process drains a silo, it does so up until the point where the LWM is reached. Since events are drained per file, this threshold can be crossed.
To check the status of the silos on a sensor device you can use this command:
> show disk-manager Silo Used Minimum Maximum misc_fdm_logs 0 KB 65.208 MB 130.417 MB Temporary Files 0 KB 108.681 MB 434.726 MB Action Queue Results 0 KB 108.681 MB 434.726 MB User Identity Events 0 KB 108.681 MB 434.726 MB UI Caches 4 KB 326.044 MB 652.089 MB Backups 0 KB 869.452 MB 2.123 GB Updates 304.367 MB 1.274 GB 3.184 GB Other Detection Engine 0 KB 652.089 MB 1.274 GB Performance Statistics 45.985 MB 217.362 MB 2.547 GB Other Events 0 KB 434.726 MB 869.452 MB IP Reputation & URL Filtering 0 KB 543.407 MB 1.061 GB arch_debug_file 0 KB 2.123 GB 12.736 GB Archives & Cores & File Logs 0 KB 869.452 MB 4.245 GB Unified Low Priority Events 974.109 MB 1.061 GB 5.307 GB RNA Events 879 KB 869.452 MB 3.396 GB File Capture 0 KB 2.123 GB 4.245 GB Unified High Priority Events 252 KB 3.184 GB 7.429 GB IPS Events 3.023 MB 2.547 GB 6.368 GB
The disk manager process runs when one of these conditions is met:
Each time the disk manager process runs it generates an entry for each of the different silos on its own log file which is located under [/ngfw]/var/log/diskmanager.log and has data in a CSV format.
Next, a sample line from the diskmanager.log file is shown. It was taken from a sensor that triggered the Drain of unprocessed events from Unified Low Priority Events health alert, as well as the breakdown of the respective columns:
priority_2_events,1599668981,221,4587929508,1132501868,20972020,4596,1586044534,5710966962,1142193392,110,0
Column | Value |
Silo Label |
priority_2_events |
Time of drain (Epoch time) |
1599668981 |
Number of files drained | 221 |
Bytes drained | 4587929508 |
Current size of data after drain (Bytes) | 1132501868 |
Largest file drained (Bytes) | 20972020 |
Smallest file drained (Bytes) | 4596 |
Oldest file drained (Epoch time) | 1586044534 |
High watermark (Bytes) | 5710966962 |
Low watermark (Bytes) | 1142193392 |
Number of files with unprocessed events drained | 110 |
Diskmanager state flag | 0 |
This information is then read by the respective Health Monitor module to trigger the related health alert.
In certain scenarios, you can want to drain a silo manually. For example, to clear disk space with manual silo drain instead of manual file removal has the benefit of allowing the disk manager to decide which files to keep and which to delete. The disk manager keeps the most recent files for that silo.
Any silo can be drained and this works as already described (the disk manager drains data until the amount of data goes under the LWM threshold). The command system support silo-drain is available on FTD CLISH mode and it provides a list of the available silos (name + numeric id).
This is an example of a manual drain of the Unified Low Priority Events silo:
> show disk-manager Silo Used Minimum Maximum misc_fdm_logs 0 KB 65.213 MB 130.426 MB Temporary Files 0 KB 108.688 MB 434.753 MB Action Queue Results 0 KB 108.688 MB 434.753 MB User Identity Events 0 KB 108.688 MB 434.753 MB UI Caches 4 KB 326.064 MB 652.130 MB Backups 0 KB 869.507 MB 2.123 GB Updates 304.367 MB 1.274 GB 3.184 GB Other Detection Engine 0 KB 652.130 MB 1.274 GB Performance Statistics 1.002 MB 217.376 MB 2.547 GB Other Events 0 KB 434.753 MB 869.507 MB IP Reputation & URL Filtering 0 KB 543.441 MB 1.061 GB arch_debug_file 0 KB 2.123 GB 12.737 GB Archives & Cores & File Logs 0 KB 869.507 MB 4.246 GB Unified Low Priority Events 2.397 GB 1.061 GB 5.307 GB RNA Events 8 KB 869.507 MB 3.397 GB File Capture 0 KB 2.123 GB 4.246 GB Unified High Priority Events 0 KB 3.184 GB 7.430 GB IPS Events 0 KB 2.547 GB 6.368 GB > system support silo-drain Available Silos 1 - misc_fdm_logs 2 - Temporary Files 3 - Action Queue Results 4 - User Identity Events 5 - UI Caches 6 - Backups 7 - Updates 8 - Other Detection Engine 9 - Performance Statistics 10 - Other Events 11 - IP Reputation & URL Filtering 12 - arch_debug_file 13 - Archives & Cores & File Logs 14 - Unified Low Priority Events 15 - RNA Events 16 - File Capture 17 - Unified High Priority Events 18 - IPS Events 0 - Cancel and return Select a Silo to drain: 14 Silo Unified Low Priority Events being drained. > show disk-manager Silo Used Minimum Maximum misc_fdm_logs 0 KB 65.213 MB 130.426 MB Temporary Files 0 KB 108.688 MB 434.753 MB Action Queue Results 0 KB 108.688 MB 434.753 MB User Identity Events 0 KB 108.688 MB 434.753 MB UI Caches 4 KB 326.064 MB 652.130 MB Backups 0 KB 869.507 MB 2.123 GB Updates 304.367 MB 1.274 GB 3.184 GB Other Detection Engine 0 KB 652.130 MB 1.274 GB Performance Statistics 1.002 MB 217.376 MB 2.547 GB Other Events 0 KB 434.753 MB 869.507 MB IP Reputation & URL Filtering 0 KB 543.441 MB 1.061 GB arch_debug_file 0 KB 2.123 GB 12.737 GB Archives & Cores & File Logs 0 KB 869.507 MB 4.246 GB Unified Low Priority Events 1.046 GB 1.061 GB 5.307 GB RNA Events 8 KB 869.507 MB 3.397 GB File Capture 0 KB 2.123 GB 4.246 GB Unified High Priority Events 0 KB 3.184 GB 7.430 GB IPS Events 0 KB 2.547 GB 6.368 GB
These are the main points:
For a Drain of Unprocessed events health alert to be triggered All these conditions must be true:
For a Frequent Drain of events health alert to be triggered these conditions must be true:
The results gathered from the disk usage module (as well as the results gathered by the other modules) are sent to the FMC via sftunnel. You can see counters for the Health Events exchanged over sftunnel with the sftunnel_status command:
TOTAL TRANSMITTED MESSAGES <3544> for Health Events service
RECEIVED MESSAGES <1772> for Health Events service
SEND MESSAGES <1772> for Health Events service
FAILED MESSAGES <0> for Health Events service
HALT REQUEST SEND COUNTER <0> for Health Events service
STORED MESSAGES for Health service (service 0/peer 0)
STATE <Process messages> for Health Events service
REQUESTED FOR REMOTE <Process messages> for Health Events service
REQUESTED FROM REMOTE <Process messages> for Health Events service
Even though most events are stored in disk, the device is configured by default to log to ramdisk to prevent gradual damage to the SSD that can be caused by constant writes and deletes of events to disk.
In this scenario, the events are not stored under [/ngfw]/var/sf/detection_engine/*/instance-N/, but they are located in [/ngfw]/var/sf/detection_engines/*/instance-N/connection/, which is a symbolic link to /dev/shm/instance-N/connection. In this case, the events reside in virtual memory rather than physical.
admin@FTD4140:~$ ls -la /ngfw/var/sf/detection_engines/b0c4a5a4-de25-11ea-8ec3-4df4ea7207e3/instance-1/connection
lrwxrwxrwx 1 sfsnort sfsnort 30 Sep 9 19:03 /ngfw/var/sf/detection_engines/b0c4a5a4-de25-11ea-8ec3-4df4ea7207e3/instance-1/connection -> /dev/shm/instance-1/connection
To verify what the device is currently configured to do run the command show log-events-to-ramdisk from the FTD CLISH. You can also change this if you use the command configure log-events-to-ramdisk <enable/disable>:
> show log-events-to-ramdisk
Logging connection events to RAM Disk.
> configure log-events-to-ramdisk
Enable or Disable enable or disable (enable/disable)
Warning: When the configure log-events-to-ramdisk disable command is executed, there is a need for two deployments to be done on the FTD for snort not to get stuck in D state (Uninterruptible Sleep), which would cause traffic outage.
This behavior is documented in the defect with Cisco bug ID CSCvz53372. With the first deployment, the reassess of the snort memory stage is skipped which causes snort to go in D state. The workaround is to do another deployment with any dummy changes.
When you log into ramdisk, the main drawback is that the respective silo has a smaller space allocated and thus drains them more often under the same circumstances. The next output is the disk manager from an FPR 4140 with and without the log events to ramdisk enabled for comparison.
Log to Ramdisk enabled.
> show disk-manager
Silo Used Minimum Maximum
Temporary Files 0 KB 903.803 MB 3.530 GB
Action Queue Results 0 KB 903.803 MB 3.530 GB
User Identity Events 0 KB 903.803 MB 3.530 GB
UI Caches 4 KB 2.648 GB 5.296 GB
Backups 0 KB 7.061 GB 17.652 GB
Updates 305.723 MB 10.591 GB 26.479 GB
Other Detection Engine 0 KB 5.296 GB 10.591 GB
Performance Statistics 19.616 MB 1.765 GB 21.183 GB
Other Events 0 KB 3.530 GB 7.061 GB
IP Reputation & URL Filtering 0 KB 4.413 GB 8.826 GB
arch_debug_file 0 KB 17.652 GB 105.914 GB
Archives & Cores & File Logs 0 KB 7.061 GB 35.305 GB
RNA Events 0 KB 7.061 GB 28.244 GB
File Capture 0 KB 17.652 GB 35.305 GB
Unified High Priority Events 0 KB 17.652 GB 30.892 GB
Connection Events 0 KB 451.698 MB 903.396 MB
IPS Events 0 KB 12.357 GB 26.479 GB
Log to Ramdisk disabled.
> show disk-manager
Silo Used Minimum Maximum
Temporary Files 0 KB 976.564 MB 3.815 GB
Action Queue Results 0 KB 976.564 MB 3.815 GB
User Identity Events 0 KB 976.564 MB 3.815 GB
UI Caches 4 KB 2.861 GB 5.722 GB
Backups 0 KB 7.629 GB 19.074 GB
Updates 305.723 MB 11.444 GB 28.610 GB
Other Detection Engine 0 KB 5.722 GB 11.444 GB
Performance Statistics 19.616 MB 1.907 GB 22.888 GB
Other Events 0 KB 3.815 GB 7.629 GB
IP Reputation & URL Filtering 0 KB 4.768 GB 9.537 GB
arch_debug_file 0 KB 19.074 GB 114.441 GB
Archives & Cores & File Logs 0 KB 7.629 GB 38.147 GB
Unified Low Priority Events 0 KB 9.537 GB 47.684 GB
RNA Events 0 KB 7.629 GB 30.518 GB
File Capture 0 KB 19.074 GB 38.147 GB
Unified High Priority Events 0 KB 19.074 GB 33.379 GB
IPS Events 0 KB 13.351 GB 28.610 GB
The smaller size of the silo is compensated by the higher speed to access the Events and stream them to the FMC. Although this is a better option under proper conditions, the drawback must be considered.
Are the Drain of Events health alerts only generated by Connection Events?
No.
Connection Events are the most common culprit.
Is it always advisable to disable Log to Ramdisk when a Frequent Drain health alert is seen?
No. Only in Excessive Logging scenarios except for DOS/DDOS, when the affected Silo is the Connection Events silo, and only in cases where it is not possible to further tune the Logging Settings.
If DOS/DDOS causes excessive Logging, the solution is to implement DOS/DDOS protection or to eliminate the source(s) of the DOS/DDOS attacks.
The default feature Log to Ramdisk reduces the SSD wear out, so it is strongly recommended to use it.
What constitutes an unprocessed event?
Events are not individually marked as unprocessed. A file has unprocessed events when:
Its creation timestamp is higher than the timestamp field in the respective bookmark file.
or
Its creation timestamp is equal to the timestamp field in the respective bookmark file and its size is higher than the position in Bytes field on the respective bookmark file.
How does the FMC know the number of Bytes behind for a particular sensor?
The sensor sends metadata about the unified_events file name and size as well as the information on the bookmark files which gives the FMC enough information to compute the bytes behind as:
Current unified_events file size - Position in Bytes field from bookmark file + Size of all unified_events files with higher timestamp than the timestamp in the respective bookmark file.
Open the Bug Search Tool and use this query:
Revision | Publish Date | Comments |
---|---|---|
3.0 |
03-May-2024 |
Updated Introduction, PII, Alt Text, Machine Translation, Link Targets, and Formatting. |
1.0 |
25-Sep-2020 |
Initial Release |