This is the multi-page printable view of this section. Click here to print.
Concepts
- 1: Collection Framework
- 2: Integrated Capability
- 2.1: Autotracing
- 2.2: Events
- 2.3: Metrics
1 - Collection Framework
HuaTuo framework provides three data collection modes: autotracing, event, and metrics, covering different monitoring scenarios, helping users gain comprehensive insights into system performance.
Collection Mode Comparison
| Mode | Type | Trigger Condition | Data Output | Use Case |
|---|---|---|---|---|
| Autotracing | Event-driven | Triggered on system anomalies | ES + Local Storage, Prometheus (optional) | Non-routine operations, triggered on anomalies |
| Event | Event-driven | Continuously running, triggered on preset thresholds | ES + Local Storage, Prometheus (optional) | Continuous operations, directly dump context |
| Metrics | Metric collection | Passive collection | Prometheus format | Monitoring system metrics |
-
Autotracing
- Type: Event-driven (tracing).
- Function: Automatically tracks system anomalies and dump context when anomalies occur.
- Features:
- When a system anomaly occurs,
autotracingis triggered automatically to dump relevant context. - Data is stored to ES in real-time and stored locally for subsequent analysis and troubleshooting. It can also be monitored in Prometheus format for statistics and alerts.
- Suitable for scenarios with high performance overhead, such as triggering captures when metrics exceed a threshold or rise too quickly.
- When a system anomaly occurs,
- Integrated Features: CPU anomaly tracking (cpu idle), D-state tracking (dload), container contention (waitrate), memory burst allocation (memburst), disk anomaly tracking (iotracer).
-
Event
- Type: Event-driven (tracing).
- Function: Continuously operates within the system context, directly dump context when preset thresholds are met.
- Features:
- Unlike
autotracing,eventcontinuously operates within the system context, rather than being triggered by anomalies. - Data is also stored to ES and locally, and can be monitored in Prometheus format.
- Suitable for continuous monitoring and real-time analysis, enabling timely detection of abnormal behaviors. The performance impact of
eventcollection is negligible.
- Unlike
- Integrated Features: Soft interrupt anomalies (softirq), memory allocation anomalies (oom), soft lockups (softlockup), D-state processes (hungtask), memory reclamation (memreclaim), packet droped abnormal (dropwatch), network ingress latency (net_rx_latency).
-
Metrics
- Type: Metric collection.
- Function: Collects performance metrics from subsystems.
- Features:
- Metric data can be sourced from regular procfs collection or derived from
tracing(autotracing, event) data. - Outputs in Prometheus format for easy integration into Prometheus monitoring systems.
- Unlike
tracingdata,metricsprimarily focus on system performance metrics such as CPU usage, memory usage, and network traffic, etc. - Suitable for monitoring system performance metrics, supporting real-time analysis and long-term trend observation.
- Metric data can be sourced from regular procfs collection or derived from
- Integrated Features: CPU (sys, usr, util, load, nr_running, etc.), memory (vmstat, memory_stat, directreclaim, asyncreclaim, etc.), IO (d2c, q2c, freeze, flush, etc.), network (arp, socket mem, qdisc, netstat, netdev, sockstat, etc.).
Multiple Purpose of Tracing Mode
Both autotracing and event belong to the tracing collection mode, offering the following dual purposes:
- Real-time storage to ES and local storage: For tracing and analyzing anomalies, helping users quickly identify root causes.
- Output in Prometheus format: As metric data integrated into Prometheus monitoring systems, providing comprehensive system monitoring capabilities.
By flexibly combining these three modes, users can comprehensively monitor system performance, capturing both contextual information during anomalies and continuous performance metrics to meet various monitoring needs.
2 - Integrated Capability
2.1 - Autotracing
HUATUO currently supports automatic tracing for the following metrics:
| Tracing Name | Core Function | Scenario |
|---|---|---|
| cpusys | Host sys surge detection | Service glitches caused by abnormal system load |
| cpuidle | Container CPU idle drop detection, providing call stacks, flame graphs, process context info, etc. | Abnormal container CPU usage, helping identify process hotspots |
| dload | Tracks container loadavg and process states, automatically captures D-state process call info in containers | System D-state surges are often related to unavailable resources or long-held locks; R-state process surges often indicate poor business logic design |
| waitrate | Container resource contention detection; provides info on contending containers during scheduling conflicts | Container contention can cause service glitches; existing metrics lack specific contending container details; waitrate tracing provides this info for mixed-deployment resource isolation reference |
| memburst | Records context info during sudden memory allocations | Detects short-term, large memory allocation events on the host, which may trigger direct reclaim or OOM |
| iotracing | Detects abnormal host disk I/O latency. Outputs context info like accessed filenames/paths, disk devices, inode numbers, containers, etc. | Frequent disk I/O bandwidth saturation or access surges leading to application request latency or system performance jitter |
CPUSYS
System mode CPU time reflects kernel execution overhead, including system calls, interrupt handling, kernel thread scheduling, memory management, lock contention, etc. Abnormal increases in this metric typically indicate kernel-level performance bottlenecks: frequent system calls, hardware device exceptions, lock contention, or memory reclaim pressure (e.g., kswapd direct reclaim).
When cpusys detects an anomaly in this metric, it automatically captures system call stacks and generates flame graphs to help identify the root cause. It considers both sustained high CPU Sys usage and sudden Sys spikes, with trigger conditions including:
- CPU Sys usage > Threshold A
- CPU Sys usage increase over a unit time > Threshold B
CPUIDLE
In K8S container environments, a sudden drop in CPU idle time (i.e., the proportion of time the CPU is idle) usually indicates that processes within the container are excessively consuming CPU resources, potentially causing business latency, scheduling contention, or even overall system performance degradation.
cpuidle automatically triggers the capture of call stacks to generate flame graphs. Trigger conditions:
- CPU Sys usage > Threshold A
- CPU User usage > Threshold B && CPU User usage increase over unit time > Threshold C
- CPU Usage > Threshold D && CPU Usage increase over unit time > Threshold E
DLOAD
The D state is a special process state where a process is blocked waiting for kernel or hardware resources. Unlike normal sleep (S state), D-state processes cannot be forcibly terminated (even with SIGKILL) and do not respond to interrupt signals. This state typically occurs during I/O operations (e.g., direct disk read/write) or hardware driver failures. System D-state surges often relate to unavailable resources or long-held locks, while runnable process surges often indicate poor business logic design. dload uses netlink to obtain the count of running + uninterruptible processes in a container, calculates the D-state process contribution to the load over the past 1 minute via a sliding window algorithm. When the smoothed D-state process load value exceeds the threshold, it triggers the collection of container runtime status and D-state process information.
MemBurst
memburst detects short-term, large memory allocation events on the host. Sudden memory allocations may trigger direct reclaim or even OOM, so context information is recorded when such allocations occur.
IOTracing
When I/O bandwidth is saturated or disk access surges suddenly, the system may experience increased request latency, performance jitter, or even overall instability due to I/O resource contention.
iotracing outputs context information—such as accessed filenames/paths, disk devices, inode numbers, and container names—during periods of high host disk load or abnormal I/O latency.
2.2 - Events
HUATUO currently supports the following exception context capture events:
| Event Name | Core Functionality | Scenarios |
|---|---|---|
| softirq | Detects delayed response or prolonged disabling of host soft interrupts, and outputs kernel call stacks and process information when soft interrupts are disabled for extended periods., etc. | This type of issue severely impacts network transmission/reception, leading to business spikes or timeout issues |
| dropwatch | Detects TCP packet loss and outputs host and network context information when packet loss occurs | This type of issue mainly causes business spikes and latency |
| net_rx_latency | Captures latency events in network receive path from driver, protocol stack, to user-space receive process | For network latency issues in the receive direction where the exact delay location is unclear, net_rx_latency calculates latency at the driver, protocol stack, and user copy paths using skb NIC ingress timestamps, filters timeout packets via preset thresholds, and locates the delay position |
| oom | Detects OOM events on the host or within containers | When OOM occurs at host level or container dimension, captures process information triggering OOM, killed process information, and container details to troubleshoot memory leaks, abnormal exits, etc. |
| softlockup | When a softlockup occurs on the system, collects target process information and CPU details, and retrieves kernel stack information from all CPUs | System softlockup events |
| hungtask | Provides count of all D-state processes in the system and kernel stack information | Used to locate transient D-state process scenarios, preserving the scene for later problem tracking |
| memreclaim | Records process information when memory reclamation exceeds time threshold | When memory pressure is excessively high, if a process requests memory at this time, it may enter direct reclamation (synchronous phase), potentially causing business process stalls. Recording the direct reclamation entry time helps assess the severity of impact on the process |
| netdev | Detects network device status changes | Network card flapping, slave abnormalities in bond environments, etc. |
| lacp | Detects LACP status changes | Detects LACP negotiation status in bond mode 4 |
Detect the long-term disabling of soft interrupts
Feature Introduction
The Linux kernel contains various contexts such as process context, interrupt context, soft interrupt context, and NMI context. These contexts may share data, so to ensure data consistency and correctness, kernel code might disable soft or hard interrupts. Theoretically, the duration of single interrupt or soft interrupt disabling shouldn’t be too long. However, high-frequency system calls entering kernel mode and frequently executing interrupt disabling can also create a “long-term disable” phenomenon, slowing down system response. Issues related to “long interrupt or soft interrupt disabling” are very subtle with limited troubleshooting methods, yet have significant impact, typically manifesting as receive data timeouts in business applications. For this scenario, we built BPF-based detection capabilities for long hardware and software interrupt disables.
Example
Below is an example of captured instances with overly long disabling interrupts, automatically uploaded to ES:
{
"_index": "***_2025-06-11",
"_type": "_doc",
"_id": "***",
"_score": 0,
"_source": {
"uploaded_time": "2025-06-11T16:05:16.251152703+08:00",
"hostname": "***",
"tracer_data": {
"comm": "observe-agent",
"stack": "stack:\nscheduler_tick/ffffffffa471dbc0 [kernel]\nupdate_process_times/ffffffffa4789240 [kernel]\ntick_sched_handle.isra.8/ffffffffa479afa0 [kernel]\ntick_sched_timer/ffffffffa479b000 [kernel]\n__hrtimer_run_queues/ffffffffa4789b60 [kernel]\nhrtimer_interrupt/ffffffffa478a610 [kernel]\n__sysvec_apic_timer_interrupt/ffffffffa4661a60 [kernel]\nasm_call_sysvec_on_stack/ffffffffa5201130 [kernel]\nsysvec_apic_timer_interrupt/ffffffffa5090500 [kernel]\nasm_sysvec_apic_timer_interrupt/ffffffffa5200d30 [kernel]\ndump_stack/ffffffffa506335e [kernel]\ndump_header/ffffffffa5058eb0 [kernel]\noom_kill_process.cold.9/ffffffffa505921a [kernel]\nout_of_memory/ffffffffa48a1740 [kernel]\nmem_cgroup_out_of_memory/ffffffffa495ff70 [kernel]\ntry_charge/ffffffffa4964ff0 [kernel]\nmem_cgroup_charge/ffffffffa4968de0 [kernel]\n__add_to_page_cache_locked/ffffffffa4895c30 [kernel]\nadd_to_page_cache_lru/ffffffffa48961a0 [kernel]\npagecache_get_page/ffffffffa4897ad0 [kernel]\ngrab_cache_page_write_begin/ffffffffa4899d00 [kernel]\niomap_write_begin/ffffffffa49fddc0 [kernel]\niomap_write_actor/ffffffffa49fe980 [kernel]\niomap_apply/ffffffffa49fbd20 [kernel]\niomap_file_buffered_write/ffffffffa49fc040 [kernel]\nxfs_file_buffered_aio_write/ffffffffc0f3bed0 [xfs]\nnew_sync_write/ffffffffa497ffb0 [kernel]\nvfs_write/ffffffffa4982520 [kernel]\nksys_write/ffffffffa4982880 [kernel]\ndo_syscall_64/ffffffffa508d190 [kernel]\nentry_SYSCALL_64_after_hwframe/ffffffffa5200078 [kernel]",
"now": 5532940660025295,
"offtime": 237328905,
"cpu": 1,
"threshold": 100000000,
"pid": 688073
},
"tracer_time": "2025-06-11 16:05:16.251 +0800",
"tracer_type": "auto",
"time": "2025-06-11 16:05:16.251 +0800",
"region": "***",
"tracer_name": "softirq",
"es_index_time": 1749629116268
},
"fields": {
"time": [
"2025-06-11T08:05:16.251Z"
]
},
"_ignored": [
"tracer_data.stack"
],
"_version": 1,
"sort": [
1749629116251
]
}
The local host also stores identical data:
2025-06-11 16:05:16 *** Region=***
{
"hostname": "***",
"region": "***",
"uploaded_time": "2025-06-11T16:05:16.251152703+08:00",
"time": "2025-06-11 16:05:16.251 +0800",
"tracer_name": "softirq",
"tracer_time": "2025-06-11 16:05:16.251 +0800",
"tracer_type": "auto",
"tracer_data": {
"offtime": 237328905,
"threshold": 100000000,
"comm": "observe-agent",
"pid": 688073,
"cpu": 1,
"now": 5532940660025295,
"stack": "stack:\nscheduler_tick/ffffffffa471dbc0 [kernel]\nupdate_process_times/ffffffffa4789240 [kernel]\ntick_sched_handle.isra.8/ffffffffa479afa0 [kernel]\ntick_sched_timer/ffffffffa479b000 [kernel]\n__hrtimer_run_queues/ffffffffa4789b60 [kernel]\nhrtimer_interrupt/ffffffffa478a610 [kernel]\n__sysvec_apic_timer_interrupt/ffffffffa4661a60 [kernel]\nasm_call_sysvec_on_stack/ffffffffa5201130 [kernel]\nsysvec_apic_timer_interrupt/ffffffffa5090500 [kernel]\nasm_sysvec_apic_timer_interrupt/ffffffffa5200d30 [kernel]\ndump_stack/ffffffffa506335e [kernel]\ndump_header/ffffffffa5058eb0 [kernel]\noom_kill_process.cold.9/ffffffffa505921a [kernel]\nout_of_memory/ffffffffa48a1740 [kernel]\nmem_cgroup_out_of_memory/ffffffffa495ff70 [kernel]\ntry_charge/ffffffffa4964ff0 [kernel]\nmem_cgroup_charge/ffffffffa4968de0 [kernel]\n__add_to_page_cache_locked/ffffffffa4895c30 [kernel]\nadd_to_page_cache_lru/ffffffffa48961a0 [kernel]\npagecache_get_page/ffffffffa4897ad0 [kernel]\ngrab_cache_page_write_begin/ffffffffa4899d00 [kernel]\niomap_write_begin/ffffffffa49fddc0 [kernel]\niomap_write_actor/ffffffffa49fe980 [kernel]\niomap_apply/ffffffffa49fbd20 [kernel]\niomap_file_buffered_write/ffffffffa49fc040 [kernel]\nxfs_file_buffered_aio_write/ffffffffc0f3bed0 [xfs]\nnew_sync_write/ffffffffa497ffb0 [kernel]\nvfs_write/ffffffffa4982520 [kernel]\nksys_write/ffffffffa4982880 [kernel]\ndo_syscall_64/ffffffffa508d190 [kernel]\nentry_SYSCALL_64_after_hwframe/ffffffffa5200078 [kernel]"
}
}
Protocol Stack Packet Loss Detection
Feature Introduction
During packet transmission and reception, packets may be lost due to various reasons, potentially causing business request delays or even timeouts. dropwatch uses eBPF to observe kernel network packet discards, outputting packet loss network context such as source/destination addresses, source/destination ports, seq, seqack, pid, comm, stack information, etc. dropwatch mainly detects TCP protocol-related packet loss, using pre-set probes to filter packets and determine packet loss locations for root cause analysis.
Example
Information captured by dropwatch is automatically uploaded to ES. Below is an example where kubelet failed to send data packet due to device packet loss:
{
"_index": "***_2025-06-11",
"_type": "_doc",
"_id": "***",
"_score": 0,
"_source": {
"uploaded_time": "2025-06-11T16:58:15.100223795+08:00",
"hostname": "***",
"tracer_data": {
"comm": "kubelet",
"stack": "kfree_skb/ffffffff9a0cd5c0 [kernel]\nkfree_skb/ffffffff9a0cd5c0 [kernel]\nkfree_skb_list/ffffffff9a0cd670 [kernel]\n__dev_queue_xmit/ffffffff9a0ea020 [kernel]\nip_finish_output2/ffffffff9a18a720 [kernel]\n__ip_queue_xmit/ffffffff9a18d280 [kernel]\n__tcp_transmit_skb/ffffffff9a1ad890 [kernel]\ntcp_connect/ffffffff9a1ae610 [kernel]\ntcp_v4_connect/ffffffff9a1b3450 [kernel]\n__inet_stream_connect/ffffffff9a1d25f0 [kernel]\ninet_stream_connect/ffffffff9a1d2860 [kernel]\n__sys_connect/ffffffff9a0c1170 [kernel]\n__x64_sys_connect/ffffffff9a0c1240 [kernel]\ndo_syscall_64/ffffffff9a2ea9f0 [kernel]\nentry_SYSCALL_64_after_hwframe/ffffffff9a400078 [kernel]",
"saddr": "10.79.68.62",
"pid": 1687046,
"type": "common_drop",
"queue_mapping": 11,
"dport": 2052,
"pkt_len": 74,
"ack_seq": 0,
"daddr": "10.179.142.26",
"state": "SYN_SENT",
"src_hostname": "***",
"sport": 15402,
"dest_hostname": "***",
"seq": 1902752773,
"max_ack_backlog": 0
},
"tracer_time": "2025-06-11 16:58:15.099 +0800",
"tracer_type": "auto",
"time": "2025-06-11 16:58:15.099 +0800",
"region": "***",
"tracer_name": "dropwatch",
"es_index_time": 1749632295120
},
"fields": {
"time": [
"2025-06-11T08:58:15.099Z"
]
},
"_ignored": [
"tracer_data.stack"
],
"_version": 1,
"sort": [
1749632295099
]
}
The local host also stores identical data:
2025-06-11 16:58:15 Host=*** Region=***
{
"hostname": "***",
"region": "***",
"uploaded_time": "2025-06-11T16:58:15.100223795+08:00",
"time": "2025-06-11 16:58:15.099 +0800",
"tracer_name": "dropwatch",
"tracer_time": "2025-06-11 16:58:15.099 +0800",
"tracer_type": "auto",
"tracer_data": {
"type": "common_drop",
"comm": "kubelet",
"pid": 1687046,
"saddr": "10.79.68.62",
"daddr": "10.179.142.26",
"sport": 15402,
"dport": 2052,
"src_hostname": ***",
"dest_hostname": "***",
"max_ack_backlog": 0,
"seq": 1902752773,
"ack_seq": 0,
"queue_mapping": 11,
"pkt_len": 74,
"state": "SYN_SENT",
"stack": "kfree_skb/ffffffff9a0cd5c0 [kernel]\nkfree_skb/ffffffff9a0cd5c0 [kernel]\nkfree_skb_list/ffffffff9a0cd670 [kernel]\n__dev_queue_xmit/ffffffff9a0ea020 [kernel]\nip_finish_output2/ffffffff9a18a720 [kernel]\n__ip_queue_xmit/ffffffff9a18d280 [kernel]\n__tcp_transmit_skb/ffffffff9a1ad890 [kernel]\ntcp_connect/ffffffff9a1ae610 [kernel]\ntcp_v4_connect/ffffffff9a1b3450 [kernel]\n__inet_stream_connect/ffffffff9a1d25f0 [kernel]\ninet_stream_connect/ffffffff9a1d2860 [kernel]\n__sys_connect/ffffffff9a0c1170 [kernel]\n__x64_sys_connect/ffffffff9a0c1240 [kernel]\ndo_syscall_64/ffffffff9a2ea9f0 [kernel]\nentry_SYSCALL_64_after_hwframe/ffffffff9a400078 [kernel]"
}
}
Protocol Stack Receive Latency
Feature Introduction
Online business network latency issues are difficult to locate, as problems can occur in any direction or stage. For example, receive direction latency might be caused by issues in drivers, protocol stack, or user programs. Therefore, we developed net_rx_latency detection functionality, leveraging skb NIC ingress timestamps to check latency at driver, protocol stack, and user-space layers. When receive latency reaches thresholds, eBPF captures network context information (five-tuple, latency location, process info, etc.). Receive path: NIC -> Driver -> Protocol Stack -> User Active Receive
Example
A business container received packets from the kernel with a latency over 90 seconds, tracked via net_rx_latency, ES query output:
{
"_index": "***_2025-06-11",
"_type": "_doc",
"_id": "***",
"_score": 0,
"_source": {
"tracer_data": {
"dport": 49000,
"pkt_len": 26064,
"comm": "nginx",
"ack_seq": 689410995,
"saddr": "10.156.248.76",
"pid": 2921092,
"where": "TO_USER_COPY",
"state": "ESTABLISHED",
"daddr": "10.134.72.4",
"sport": 9213,
"seq": 1009085774,
"latency_ms": 95973
},
"container_host_namespace": "***",
"container_hostname": "***.docker",
"es_index_time": 1749628496541,
"uploaded_time": "2025-06-11T15:54:56.404864955+08:00",
"hostname": "***",
"container_type": "normal",
"tracer_time": "2025-06-11 15:54:56.404 +0800",
"time": "2025-06-11 15:54:56.404 +0800",
"region": "***",
"container_level": "1",
"container_id": "***",
"tracer_name": "net_rx_latency"
},
"fields": {
"time": [
"2025-06-11T07:54:56.404Z"
]
},
"_version": 1,
"sort": [
1749628496404
]
}
The local host also stores identical data:
2025-06-11 15:54:46 Host=*** Region=*** ContainerHost=***.docker ContainerID=*** ContainerType=normal ContainerLevel=1
{
"hostname": "***",
"region": "***",
"container_id": "***",
"container_hostname": "***.docker",
"container_host_namespace": "***",
"container_type": "normal",
"container_level": "1",
"uploaded_time": "2025-06-11T15:54:46.129136232+08:00",
"time": "2025-06-11 15:54:46.129 +0800",
"tracer_time": "2025-06-11 15:54:46.129 +0800",
"tracer_name": "net_rx_latency",
"tracer_data": {
"comm": "nginx",
"pid": 2921092,
"where": "TO_USER_COPY",
"latency_ms": 95973,
"state": "ESTABLISHED",
"saddr": "10.156.248.76",
"daddr": "10.134.72.4",
"sport": 9213,
"dport": 49000,
"seq": 1009024958,
"ack_seq": 689410995,
"pkt_len": 20272
}
}
Host/Container Memory Overused
Feature Introduction
When programs request more memory than available system or process limits during runtime, it can cause system or application crashes. Common in memory leaks, big data processing, or insufficient resource configuration scenarios. By inserting BPF hooks in the OOM kernel flow, detailed OOM context information is captured and passed to user space, including process information, killed process information, and container details.
Example
When OOM occurs in a container, captured information:
{
"_index": "***_cases_2025-06-11",
"_type": "_doc",
"_id": "***",
"_score": 0,
"_source": {
"uploaded_time": "2025-06-11T17:09:07.236482841+08:00",
"hostname": "***",
"tracer_data": {
"victim_process_name": "java",
"trigger_memcg_css": "0xff4b8d8be3818000",
"victim_container_hostname": "***.docker",
"victim_memcg_css": "0xff4b8d8be3818000",
"trigger_process_name": "java",
"victim_pid": 3218745,
"trigger_pid": 3218804,
"trigger_container_hostname": "***.docker",
"victim_container_id": "***",
"trigger_container_id": "***",
"tracer_time": "2025-06-11 17:09:07.236 +0800",
"tracer_type": "auto",
"time": "2025-06-11 17:09:07.236 +0800",
"region": "***",
"tracer_name": "oom",
"es_index_time": 1749632947258
},
"fields": {
"time": [
"2025-06-11T09:09:07.236Z"
]
},
"_version": 1,
"sort": [
1749632947236
]
}
Additionally, oom event implements Collector interface, which enables collecting statistics on host OOM occurrences via Prometheus, distinguishing between events from the host and containers.
Kernel Softlockup
Feature Introduction
Softlockup is an abnormal state detected by the Linux kernel where a kernel thread (or process) on a CPU core occupies the CPU for a long time without scheduling, preventing the system from responding normally to other tasks. Causes include kernel code bugs, CPU overload, device driver issues, and others. When a softlockup occurs in the system, information about the target process and CPU is collected, kernel stack information from all CPUs is retrieved, and the number of occurrences of the issue is recorded.
Process Blocking
Feature Introduction
A D-state process (also known as Uninterruptible Sleep) is a special process state indicating that the process is blocked while waiting for certain system resources and cannot be awakened by signals or external interrupts. Common scenarios include disk I/O operations, kernel blocking, hardware failures, etc. hungtask captures the kernel stacks of all D-state processes within the system and records the count of such processes. It is used to locate transient scenarios where D-state processes appear momentarily, enabling root cause analysis even after the scenario has resolved.
Example
{
"_index": "***_2025-06-10",
"_type": "_doc",
"_id": "8yyOV5cBGoYArUxjSdvr",
"_score": 0,
"_source": {
"uploaded_time": "2025-06-10T09:57:12.202191192+08:00",
"hostname": "***",
"tracer_data": {
"cpus_stack": "2025-06-10 09:57:14 sysrq: Show backtrace of all active CPUs\n2025-06-10 09:57:14 NMI backtrace for cpu 33\n2025-06-10 09:57:14 CPU: 33 PID: 768309 Comm: huatuo-bamai Kdump: loaded Tainted: G S W OEL 5.10.0-216.0.0.115.v1.0.x86_64 #1\n2025-06-10 09:57:14 Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.1.12 11/27/2019\n2025-06-10 09:57:14 Call Trace:\n2025-06-10 09:57:14 dump_stack+0x57/0x6e\n2025-06-10 09:57:14 nmi_cpu_backtrace.cold.0+0x30/0x65\n2025-06-10 09:57:14 ? lapic_can_unplug_cpu+0x80/0x80\n2025-06-10 09:57:14 nmi_trigger_cpumask_backtrace+0xdf/0xf0\n2025-06-10 09:57:14 arch_trigger_cpumask_backtrace+0x15/0x20\n2025-06-10 09:57:14 sysrq_handle_showallcpus+0x14/0x90\n2025-06-10 09:57:14 __handle_sysrq.cold.8+0x77/0xe8\n2025-06-10 09:57:14 write_sysrq_trigger+0x3d/0x60\n2025-06-10 09:57:14 proc_reg_write+0x38/0x80\n2025-06-10 09:57:14 vfs_write+0xdb/0x250\n2025-06-10 09:57:14 ksys_write+0x59/0xd0\n2025-06-10 09:57:14 do_syscall_64+0x39/0x80\n2025-06-10 09:57:14 entry_SYSCALL_64_after_hwframe+0x62/0xc7\n2025-06-10 09:57:14 RIP: 0033:0x4088ae\n2025-06-10 09:57:14 Code: 48 83 ec 38 e8 13 00 00 00 48 83 c4 38 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 49 89 f2 48 89 fa 48 89 ce 48 89 df 0f 05 <48> 3d 01 f0 ff ff 76 15 48 f7 d8 48 89 c1 48 c7 c0 ff ff ff ff 48\n2025-06-10 09:57:14 RSP: 002b:000000c000adcc60 EFLAGS: 00000212 ORIG_RAX: 0000000000000001\n2025-06-10 09:57:14 RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00000000004088ae\n2025-06-10 09:57:14 RDX: 0000000000000001 RSI: 000000000274ab18 RDI: 0000000000000013\n2025-06-10 09:57:14 RBP: 000000c000adcca0 R08: 0000000000000000 R09: 0000000000000000\n2025-06-10 09:57:14 R10: 0000000000000000 R11: 0000000000000212 R12: 000000c000adcdc0\n2025-06-10 09:57:14 R13: 0000000000000002 R14: 000000c000caa540 R15: 0000000000000000\n2025-06-10 09:57:14 Sending NMI from CPU 33 to CPUs 0-32,34-95:\n2025-06-10 09:57:14 NMI backtrace for cpu 52 skipped: idling at intel_idle+0x6f/0xc0\n2025-06-10 09:57:14 NMI backtrace for cpu 54 skipped: idling at intel_idle+0x6f/0xc0\n2025-06-10 09:57:14 NMI backtrace for cpu 7 skipped: idling at intel_idle+0x6f/0xc0\n2025-06-10 09:57:14 NMI backtrace for cpu 81 skipped: idling at intel_idle+0x6f/0xc0\n2025-06-10 09:57:14 NMI backtrace for cpu 60 skipped: idling at intel_idle+0x6f/0xc0\n2025-06-10 09:57:14 NMI backtrace for cpu 2 skipped: idling at intel_idle+0x6f/0xc0\n2025-06-10 09:57:14 NMI backtrace for cpu 21 skipped: idling at intel_idle+0x6f/0xc0\n2025-06-10 09:57:14 NMI backtrace for cpu 69 skipped: idling at intel_idle+0x6f/0xc0\n2025-06-10 09:57:14 NMI backtrace for cpu 58 skipped: idling at intel_idle+0x6f/
...
"pid": 2567042
},
"tracer_time": "2025-06-10 09:57:12.202 +0800",
"tracer_type": "auto",
"time": "2025-06-10 09:57:12.202 +0800",
"region": "***",
"tracer_name": "hungtask",
"es_index_time": 1749520632297
},
"fields": {
"time": [
"2025-06-10T01:57:12.202Z"
]
},
"_ignored": [
"tracer_data.blocked_processes_stack",
"tracer_data.cpus_stack"
],
"_version": 1,
"sort": [
1749520632202
]
}
Additionally, the hungtask event implements the Collector interface, which also enables collecting statistics on host hungtask occurrences via Prometheus.
Container/Host Memory Reclamation
Feature Introduction
When memory pressure is excessively high, if a process requests memory at this time, it may enter direct reclamation. This phase involves synchronous reclamation and may cause business process stalls. Recording the time when a process enters direct reclamation helps us assess the severity of impact from direct reclamation on that process. The memreclaim event calculates whether the same process remains in direct reclamation for over 900ms within a 1-second cycle; if so, it records the process’s contextual information.
Example
When a business container’s chrome process enters direct reclamation, the ES query output is as follows:
{
"_index": "***_cases_2025-06-11",
"_type": "_doc",
"_id": "***",
"_score": 0,
"_source": {
"tracer_data": {
"comm": "chrome",
"deltatime": 1412702917,
"pid": 1896137
},
"container_host_namespace": "***",
"container_hostname": "***.docker",
"es_index_time": 1749641583290,
"uploaded_time": "2025-06-11T19:33:03.26754495+08:00",
"hostname": "***",
"container_type": "normal",
"tracer_time": "2025-06-11 19:33:03.267 +0800",
"time": "2025-06-11 19:33:03.267 +0800",
"region": "***",
"container_level": "102",
"container_id": "921d0ec0a20c",
"tracer_name": "directreclaim"
},
"fields": {
"time": [
"2025-06-11T11:33:03.267Z"
]
},
"_version": 1,
"sort": [
1749641583267
]
}
Network Device Status
Feature Introduction
Network card status changes often cause severe network issues, directly impacting overall host network quality, such as down/up states, MTU changes, etc. Taking the down state as an example, possible causes include operations by privileged processes, underlying cable issues, optical module failures, peer switch problems, etc. The netdev event is designed to detect network device status changes and currently implements monitoring for network card down/up events, distinguishing between administrator-initiated and underlying cause-induced status changes.
Example
When an administrator operation causes the eth1 network card to go down, the ES query event output is as follows:
{
"_index": "***_cases_2025-05-30",
"_type": "_doc",
"_id": "***",
"_score": 0,
"_source": {
"uploaded_time": "2025-05-30T17:47:50.406913037+08:00",
"hostname": "localhost.localdomain",
"tracer_data": {
"ifname": "eth1",
"start": false,
"index": 3,
"linkstatus": "linkStatusAdminDown, linkStatusCarrierDown",
"mac": "5c:6f:69:34:dc:72"
},
"tracer_time": "2025-05-30 17:47:50.406 +0800",
"tracer_type": "auto",
"time": "2025-05-30 17:47:50.406 +0800",
"region": "***",
"tracer_name": "netdev_event",
"es_index_time": 1748598470407
},
"fields": {
"time": [
"2025-05-30T09:47:50.406Z"
]
},
"_version": 1,
"sort": [
1748598470406
]
}
LACP Protocol Status
Feature Introduction
Bond is a technology provided by the Linux system kernel that bundles multiple physical network interfaces into a single logical interface. Through bonding, bandwidth aggregation, failover, or load balancing can be achieved. LACP is a protocol defined by the IEEE 802.3ad standard for dynamically managing Link Aggregation Groups (LAG). Currently, there is no elegant method to obtain physical host LACP protocol negotiation exception events. HUATUO implements the lacp event, which uses BPF to instrument key protocol paths. When a change in link aggregation status is detected, it triggers an event to record relevant information.
Example
When the host network card eth1 experiences physical layer down/up fluctuations, the LACP dynamic negotiation status becomes abnormal. The ES query output is as follows:
{
"_index": "***_cases_2025-05-30",
"_type": "_doc",
"_id": "***",
"_score": 0,
"_source": {
"uploaded_time": "2025-05-30T17:47:48.513318579+08:00",
"hostname": "***",
"tracer_data": {
"content": "/proc/net/bonding/bond0\nEthernet Channel Bonding Driver: v4.18.0 (Apr 7, 2025)\n\nBonding Mode: load balancing (round-robin)\nMII Status: down\nMII Polling Interval (ms): 0\nUp Delay (ms): 0\nDown Delay (ms): 0\nPeer Notification Delay (ms): 0\n/proc/net/bonding/bond4\nEthernet Channel Bonding Driver: v4.18.0 (Apr 7, 2025)\n\nBonding Mode: IEEE 802.3ad Dynamic link aggregation\nTransmit Hash Policy: layer3+4 (1)\nMII Status: up\nMII Polling Interval (ms): 100\nUp Delay (ms): 0\nDown Delay (ms): 0\nPeer Notification Delay (ms): 1000\n\n802.3ad info\nLACP rate: fast\nMin links: 0\nAggregator selection policy (ad_select): stable\nSystem priority: 65535\nSystem MAC address: 5c:6f:69:34:dc:72\nActive Aggregator Info:\n\tAggregator ID: 1\n\tNumber of ports: 2\n\tActor Key: 21\n\tPartner Key: 50013\n\tPartner Mac Address: 00:00:5e:00:01:01\n\nSlave Interface: eth0\nMII Status: up\nSpeed: 25000 Mbps\nDuplex: full\nLink Failure Count: 0\nPermanent HW addr: 5c:6f:69:34:dc:72\nSlave queue ID: 0\nSlave active: 1\nSlave sm_vars: 0x172\nAggregator ID: 1\nAggregator active: 1\nActor Churn State: none\nPartner Churn State: none\nActor Churned Count: 0\nPartner Churned Count: 0\ndetails actor lacp pdu:\n system priority: 65535\n system mac address: 5c:6f:69:34:dc:72\n port key: 21\n port priority: 255\n port number: 1\n port state: 63\ndetails partner lacp pdu:\n system priority: 200\n system mac address: 00:00:5e:00:01:01\n oper key: 50013\n port priority: 32768\n port number: 16397\n port state: 63\n\nSlave Interface: eth1\nMII Status: up\nSpeed: 25000 Mbps\nDuplex: full\nLink Failure Count: 17\nPermanent HW addr: 5c:6f:69:34:dc:73\nSlave queue ID: 0\nSlave active: 0\nSlave sm_vars: 0x172\nAggregator ID: 1\nAggregator active: 1\nActor Churn State: monitoring\nPartner Churn State: monitoring\nActor Churned Count: 2\nPartner Churned Count: 2\ndetails actor lacp pdu:\n system priority: 65535\n system mac address: 5c:6f:69:34:dc:72\n port key: 21\n port priority: 255\n port number: 2\n port state: 15\ndetails partner lacp pdu:\n system priority: 200\n system mac address: 00:00:5e:00:01:01\n oper key: 50013\n port priority: 32768\n port number: 32781\n port state: 31\n"
},
"tracer_time": "2025-05-30 17:47:48.513 +0800",
"tracer_type": "auto",
"time": "2025-05-30 17:47:48.513 +0800",
"region": "***",
"tracer_name": "lacp",
"es_index_time": 1748598468514
},
"fields": {
"time": [
"2025-05-30T09:47:48.513Z"
]
},
"_ignored": [
"tracer_data.content"
],
"_version": 1,
"sort": [
1748598468513
]
}
2.3 - Metrics
| Subsystem | Metric | Description | Unit | Dimension | Source |
|---|---|---|---|---|---|
| cpu | cpu_util_sys | Time of running kernel processes percentage of host | % | host | Calculate base on cpuacct.stat and cpuacct.usage |
| cpu | cpu_util_usr | Time of running user processes percentage of host | % | host | Calculate base on cpuacct.stat and cpuacct.usage |
| cpu | cpu_util_total | Total time of running percentage of host | % | host | Calculate base on cpuacct.stat and cpuacct.usage |
| cpu | cpu_util_container_sys | Time of running kernel processes percentage of container | % | container | Calculate base on cpuacct.stat and cpuacct.usage |
| cpu | cpu_util_container_usr | Time of running user processes percentage of container | % | container | Calculate base on cpuacct.stat and cpuacct.usage |
| cpu | cpu_util_container_total | Total time of running percentage of container | % | container | Calculate base on cpuacct.stat and cpuacct.usage |
| cpu | cpu_stat_container_burst_time | Cumulative wall-time (in nanoseconds) that any CPUs has used above quota in respective periods | ns | container | cpu.stat |
| cpu | cpu_stat_container_nr_bursts | Number of periods burst occurs | count | container | cpu.stat |
| cpu | cpu_stat_container_nr_throttled | Number of times the group has been throttled/limited | count | container | cpu.stat |
| cpu | cpu_stat_container_exter_wait_rate | Wait rate caused by processes outside the container | % | container | Calculate base on throttled_time/hierarchy_wait_sum/inner_wait_sum read from cpu.stat |
| cpu | cpu_stat_container_inner_wait_rate | Wait rate caused by processes inside the container | % | container | Calculate base on throttled_time/hierarchy_wait_sum/inner_wait_sum read from cpu.stat |
| cpu | cpu_stat_container_throttle_wait_rate | Wait rate caused by throttle of container | % | container | Calculate base on throttled_time/hierarchy_wait_sum/inner_wait_sum read from cpu.stat |
| cpu | cpu_stat_container_wait_rate | Total wait rate: exter_wait_rate + inner_wait_rate + throttle_wait_rate | % | container | Calculate base on throttled_time/hierarchy_wait_sum/inner_wait_sum read from cpu.stat |
| cpu | loadavg_container_container_nr_running | The number of running tasks in the container | count | container | get from kernel via netlink |
| cpu | loadavg_container_container_nr_uninterruptible | The number of uninterruptible tasks in the container | count | container | get from kernel via netlink |
| cpu | loadavg_load1 | System load avg over the last 1 minute | count | host | proc fs |
| cpu | loadavg_load5 | System load avg over the last 5 minute | count | host | proc fs |
| cpu | loadavg_load15 | system load avg over the last 15 minute | count | host | proc fs |
| cpu | monsoftirq_latency | The number of NET_RX/NET_TX irq latency happend in the following regions: 0~10 us 100us ~ 1ms 10us ~ 100us 1ms ~ inf |
count | host | hook the softirq event and do time statistics via bpf |
| cpu | runqlat_container_nlat_01 | The number of times when schedule latency of processes in the container is within 0~10ms | count | container | hook the scheduling switch event and do time statistics via bpf |
| cpu | runqlat_container_nlat_02 | The number of times when schedule latency of processes in the container is within 10~20ms | count | container | hook the scheduling switch event and do time statistics via bpf |
| cpu | runqlat_container_nlat_03 | The number of times when schedule latency of processes in the container is within 20~50ms | count | container | hook the scheduling switch event and do time statistics via bpf |
| cpu | runqlat_container_nlat_04 | The number of times when schedule latency of processes in the container is more than 50ms | count | container | hook the scheduling switch event and do time statistics via bpf |
| cpu | runqlat_g_nlat_01 | The number of times when schedule latency of processes in the host is within 0~10ms |
count | host | hook the scheduling switch event and do time statistics via bpf |
| cpu | runqlat_g_nlat_02 | The number of times when schedule latency of processes in the host is within 10~20ms | count | host | hook the scheduling switch event and do time statistics via bpf |
| cpu | runqlat_g_nlat_03 | The number of times when schedule latency of processes in the host is within 20~50ms | count | host | hook the scheduling switch event and do time statistics via bpf |
| cpu | runqlat_g_nlat_04 | The number of times when schedule latency of processes in the host is more than 50ms | count | host | hook the scheduling switch event and do time statistics via bpf |
| cpu | reschedipi_oversell_probability | The possibility of cpu overselling exists on the host where the vm is located | 0-1 | host | hook the scheduling ipi event and do time statistics via bpf |
| memory | buddyinfo_blocks | Kernel memory allocator information | pages | host | proc fs |
| memory | memory_events_container_watermark_inc | Counts of memory allocation watermark increasing | count | container | memory.events |
| memory | memory_events_container_watermark_dec | Counts of memory allocation watermark decreasing | count | container | memory.events |
| memory | memory_others_container_local_direct_reclaim_time | Time speed in page allocation in memory cgroup | nanosecond | container | memory.local_direct_reclaim_time |
| memory | memory_others_container_directstall_time | Memory cgroup’s direct reclaim time in try_charge | nanosecond | container | memory.directstall_stat |
| memory | memory_others_container_asyncreclaim_time | Memory cgroup’s direct reclaim time in cgroup async memory reclaim | nanosecond | container | memory.asynreclaim_stat |
| memory | priority_reclaim_kswapd | Kswapd’s reclaim stat in priority reclaiming | pages | host | proc fs |
| memory | priority_reclaim_direct | Direct reclaim stat in priority reclaiming | pages | host | proc fs |
| memory | memory_stat_container_writeback | Bytes of file/anon cache that are queued for syncing to disk | bytes | container | memory.stat |
| memory | memory_stat_container_unevictable | Bytes of memory that cannot be reclaimed (mlocked etc) | bytes | container | memory.stat |
| memory | memory_stat_container_shmem | Bytes of shmem memory | bytes | container | memory.stat |
| memory | memory_stat_container_pgsteal_kswapd | Bytes of reclaimed memory by kswapd and cswapd | bytes | container | memory.stat |
| memory | memory_stat_container_pgsteal_globalkswapd | Bytes of reclaimed memory by kswapd | bytes | container | memory.stat |
| memory | memory_stat_container_pgsteal_globaldirect | Bytes of reclaimed memory by direct reclaim during page allocation | bytes | container | memory.stat |
| memory | memory_stat_container_pgsteal_direct | Bytes of reclaimed memory by direct reclaim during page allocation and try_charge | bytes | container | memory.stat |
| memory | memory_stat_container_pgsteal_cswapd | Bytes of reclaimed memory by cswapd | bytes | container | memory.stat |
| memory | memory_stat_container_pgscan_kswapd | Bytes of scanned memory by kswapd and cswapd | bytes | container | memory.stat |
| memory | memory_stat_container_pgscan_globalkswapd | Bytes of scanned memory by kswapd | bytes | container | memory.stat |
| memory | memory_stat_container_pgscan_globaldirect | Bytes of scanned memory by direct reclaim during page allocation | bytes | container | memory.stat |
| memory | memory_stat_container_pgscan_direct | Bytes of scanned memory by direct reclaim during page allocation and try_charge | bytes | container | memory.stat |
| memory | memory_stat_container_pgscan_cswapd | Bytes of scanned memory by cswapd | bytes | container | memory.stat |
| memory | memory_stat_container_pgrefill | Bytes of memory that is scanned in active list | bytes | container | memory.stat |
| memory | memory_stat_container_pgdeactivate | Bytes of memory that is deactivated into inactive list | bytes | container | memory.stat |
| memory | memory_stat_container_inactive_file | Bytes of file-backed memory on inactive lru list. | bytes | container | memory.stat |
| memory | memory_stat_container_inactive_anon | Bytes of anonymous and swap cache memory on inactive lru list | bytes | container | memory.stat |
| memory | memory_stat_container_dirty | Bytes that are waiting to get written back to the disk | bytes | container | memory.stat |
| memory | memory_stat_container_active_file | Bytes of file-backed memory on active lru list | bytes | container | memory.stat |
| memory | memory_stat_container_active_anon | Bytes of anonymous and swap cache memory on active lru list | bytes | container | memory.stat |
| memory | mountpoint_perm_ro | Whether mountpoint is readonly or not | bool | host | proc fs |
| memory | vmstat_allocstall_normal | Host direct reclaim count on normal zone | count | host | /proc/vmstat |
| memory | vmstat_allocstall_movable | Host direct reclaim count on movable zone | count | host | /proc/vmstat |
| memory | vmstat_compact_stall | Count of memory compaction | count | host | /proc/vmstat |
| memory | vmstat_nr_active_anon | Number of anonymous pages on active lru | pages | host | /proc/vmstat |
| memory | vmstat_nr_active_file | Number of file-backed pages on active lru | pages | host | /proc/vmstat |
| memory | vmstat_nr_boost_pages | Number of pages in kswapd boosting | pages | host | /proc/vmstat |
| memory | vmstat_nr_dirty | Number of dirty pages | pages | host | /proc/vmstat |
| memory | vmstat_nr_free_pages | Number of free pages | pages | host | /proc/vmstat |
| memory | vmstat_nr_inactive_anon | Number of anonymous pages on inactive lru | pages | host | /proc/vmstat |
| memory | vmstat_nr_inactive_file | Number of file-backed pages on inactive lru | pages | host | /proc/vmstat |
| memory | vmstat_nr_kswapd_boost | Count of kswapd boosting | pages | host | /proc/vmstat |
| memory | vmstat_nr_mlock | Number of locked pages | pages | host | /proc/vmstat |
| memory | vmstat_nr_shmem | Number of shmem pages | pages | host | /proc/vmstat |
| memory | vmstat_nr_slab_reclaimable | Number of relcaimable slab pages | pages | host | /proc/vmstat |
| memory | vmstat_nr_slab_unreclaimable | Number of unrelcaimable slab pages | pages | host | /proc/vmstat |
| memory | vmstat_nr_unevictable | Number of unevictable pages | pages | host | /proc/vmstat |
| memory | vmstat_nr_writeback | Number of writebacking pages | pages | host | /proc/vmstat |
| memory | vmstat_numa_pages_migrated | Number of pages in numa migrating | pages | host | /proc/vmstat |
| memory | vmstat_pgdeactivate | Number of pages which are deactivated into inactive lru | pages | host | /proc/vmstat |
| memory | vmstat_pgrefill | Number of pages which are scanned on active lru | pages | host | /proc/vmstat |
| memory | vmstat_pgscan_direct | Number of pages which are scanned in direct reclaim | pages | host | /proc/vmstat |
| memory | vmstat_pgscan_kswapd | Number of pages which are scanned in kswapd reclaim | pages | host | /proc/vmstat |
| memory | vmstat_pgsteal_direct | Number of pages which are reclaimed in direct reclaim | pages | host | /proc/vmstat |
| memory | vmstat_pgsteal_kswapd | Number of pages which are reclaimed in kswapd reclaim | pages | host | /proc/vmstat |
| memory | hungtask_happened | Count of hungtask events | count | host | performance and statistics monitoring for BPF Programs |
| memory | oom_happened | Count of oom events | count | host,container | performance and statistics monitoring for BPF Programs |
| memory | softlockup_happened | Count of softlockup events | count | host | performance and statistics monitoring for BPF Programs |
| memory | mmhostbpf_compactionstat | Time speed in memory compaction | nanosecond | host | performance and statistics monitoring for BPF Programs |
| memory | mmhostbpf_allocstallstat | Time speed in memory direct reclaim on host | nanosecond | host | performance and statistics monitoring for BPF Programs |
| memory | mmcgroupbpf_container_directstallcount | Count of cgroup’s try_charge direct reclaim | count | container | performance and statistics monitoring for BPF Programs |
| IO | iolatency_disk_d2c | Statistics of io latency when accessing the disk, including the time consumed by the driver and hardware components | count | host | performance and statistics monitoring for BPF Programs |
| IO | iolatency_disk_q2c | Statistics of io latency for the entire io lifecycle when accessing the disk | count | host | performance and statistics monitoring for BPF Programs |
| IO | iolatency_container_d2c | Statistics of io latency when accessing the disk, including the time consumed by the driver and hardware components | count | container | performance and statistics monitoring for BPF Programs |
| IO | iolatency_container_q2c | Statistics of io latency for the entire io lifecycle when accessing the disk | count | container | performance and statistics monitoring for BPF Programs |
| IO | iolatency_disk_flush | Statistics of delay for flush operations on disk raid device | count | host | performance and statistics monitoring for BPF Programs |
| IO | iolatency_container_flush | Statistics of delay for flush operations on disk raid devices caused by containers | count | container | performance and statistics monitoring for BPF Programs |
| IO | iolatency_disk_freeze | Statistics of disk freeze events | count | host | performance and statistics monitoring for BPF Programs |
| network | tcp_mem_limit_pages | System TCP total memory size limit | pages | system | proc fs |
| network | tcp_mem_usage_bytes | The total number of bytes of TCP memory used by the system | bytes | system | tcp_mem_usage_pages * page_size |
| network | tcp_mem_usage_pages | The total size of TCP memory used by the system | pages | system | proc fs |
| network | tcp_mem_usage_percent | The percentage of TCP memory used by the system to the limit size | % | system | tcp_mem_usage_pages / tcp_mem_limit_pages |
| network | arp_entries | The number of arp cache entries | count | host,container | proc fs |
| network | arp_total | Total number of arp cache entries | count | system | proc fs |
| network | qdisc_backlog | The number of bytes queued to be sent | bytes | host | sum of same level(parent major) for a device |
| network | qdisc_bytes_total | The number of bytes sent | bytes | host | sum of same level(parent major) for a device |
| network | qdisc_current_queue_length | The number of packets queued for sending | count | host | sum of same level(parent major) for a device |
| network | qdisc_drops_total | The number of discarded packets | count | host | sum of same level(parent major) for a device |
| network | qdisc_overlimits_total | The number of queued packets exceeds the limit | count | host | sum of same level(parent major) for a device |
| network | qdisc_packets_total | The number of packets sent | count | host | sum of same level(parent major) for a device |
| network | qdisc_requeues_total | The number of packets that were not sent successfully and were requeued | count | host | sum of same level(parent major) for a device |
| network | ethtool_hardware_rx_dropped_errors | Statistics of inbound packet droped or errors of interface | count | host | related to hardware drivers, such as mlx, ixgbe, bnxt_en, etc. |
| network | netdev_receive_bytes_total | Number of good received bytes | bytes | host,container | proc fs |
| network | netdev_receive_compressed_total | Number of correctly received compressed packets | count | host,container | proc fs |
| network | netdev_receive_dropped_total | Number of packets received but not processed | count | host,container | proc fs |
| network | netdev_receive_errors_total | Total number of bad packets received on this network device | count | host,container | proc fs |
| network | netdev_receive_fifo_total | Receiver FIFO error counter | count | host,container | proc fs |
| network | netdev_receive_frame_total | Receiver frame alignment errors | count | host,container | proc fs |
| network | netdev_receive_multicast_total | Multicast packets received. For hardware interfaces this statistic is commonly calculated at the device level (unlike rx_packets) and therefore may include packets which did not reach the host | count | host,container | proc fs |
| network | netdev_receive_packets_total | Number of good packets received by the interface | count | host,container | proc fs |
| network | netdev_transmit_bytes_total | Number of good transmitted bytes, corresponding to tx_packets | bytes | host,container | proc fs |
| network | netdev_transmit_carrier_total | Number of frame transmission errors due to loss of carrier during transmission | count | host,container | proc fs |
| network | netdev_transmit_colls_total | Number of collisions during packet transmissions | count | host,container | proc fs |
| network | netdev_transmit_compressed_total | Number of transmitted compressed packets | count | host,container | proc fs |
| network | netdev_transmit_dropped_total | Number of packets dropped on their way to transmission, e.g. due to lack of resources | count | host,container | proc fs |
| network | netdev_transmit_errors_total | Total number of transmit problems | count | host,container | proc fs |
| network | netdev_transmit_fifo_total | Number of frame transmission errors due to device FIFO underrun / underflow | count | host,container | proc fs |
| network | netdev_transmit_packets_total | Number of packets successfully transmitted | count | host,container | proc fs |
| network | netstat_TcpExt_ArpFilter | - | count | host,container | proc fs |
| network | netstat_TcpExt_BusyPollRxPackets | - | count | host,container | proc fs |
| network | netstat_TcpExt_DelayedACKLocked | A delayed ACK timer expires, but the TCP stack can’t send an ACK immediately due to the socket is locked by a userspace program. The TCP stack will send a pure ACK later (after the userspace program unlock the socket). When the TCP stack sends the pure ACK later, the TCP stack will also update TcpExtDelayedACKs and exit the delayed ACK mode | count | host,container | proc fs |
| network | netstat_TcpExt_DelayedACKLost | It will be updated when the TCP stack receives a packet which has been ACKed. A Delayed ACK loss might cause this issue, but it would also be triggered by other reasons, such as a packet is duplicated in the network | count | host,container | proc fs |
| network | netstat_TcpExt_DelayedACKs | A delayed ACK timer expires. The TCP stack will send a pure ACK packet and exit the delayed ACK mode | count | host,container | proc fs |
| network | netstat_TcpExt_EmbryonicRsts | resets received for embryonic SYN_RECV sockets | count | host,container | proc fs |
| network | netstat_TcpExt_IPReversePathFilter | - | count | host,container | proc fs |
| network | netstat_TcpExt_ListenDrops | When kernel receives a SYN from a client, and if the TCP accept queue is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows. At the same time kernel will also add 1 to TcpExtListenDrops. When a TCP socket is in LISTEN state, and kernel need to drop a packet, kernel would always add 1 to TcpExtListenDrops. So increase TcpExtListenOverflows would let TcpExtListenDrops increasing at the same time, but TcpExtListenDrops would also increase without TcpExtListenOverflows increasing, e.g. a memory allocation fail would also let TcpExtListenDrops increase | count | host,container | proc fs |
| network | netstat_TcpExt_ListenOverflows | When kernel receives a SYN from a client, and if the TCP accept queue is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows. At the same time kernel will also add 1 to TcpExtListenDrops. When a TCP socket is in LISTEN state, and kernel need to drop a packet, kernel would always add 1 to TcpExtListenDrops. So increase TcpExtListenOverflows would let TcpExtListenDrops increasing at the same time, but TcpExtListenDrops would also increase without TcpExtListenOverflows increasing, e.g. a memory allocation fail would also let TcpExtListenDrops increase | count | host,container | proc fs |
| network | netstat_TcpExt_LockDroppedIcmps | ICMP packets dropped because socket was locked | count | host,container | proc fs |
| network | netstat_TcpExt_OfoPruned | The TCP stack tries to discard packet on the out of order queue | count | host,container | proc fs |
| network | netstat_TcpExt_OutOfWindowIcmps | ICMP pkts dropped because they were out-of-window | count | host,container | proc fs |
| network | netstat_TcpExt_PAWSActive | Packets are dropped by PAWS in Syn-Sent status | count | host,container | proc fs |
| network | netstat_TcpExt_PAWSEstab | Packets are dropped by PAWS in any status other than Syn-Sent | count | host,container | proc fs |
| network | netstat_TcpExt_PFMemallocDrop | - | count | host,container | proc fs |
| network | netstat_TcpExt_PruneCalled | The TCP stack tries to reclaim memory for a socket. After updates this counter, the TCP stack will try to collapse the out of order queue and the receiving queue. If the memory is still not enough, the TCP stack will try to discard packets from the out of order queue (and update the TcpExtOfoPruned counter) | count | host,container | proc fs |
| network | netstat_TcpExt_RcvPruned | After ‘collapse’ and discard packets from the out of order queue, if the actually used memory is still larger than the max allowed memory, this counter will be updated. It means the ‘prune’ fails | count | host,container | proc fs |
| network | netstat_TcpExt_SyncookiesFailed | The MSS decoded from the SYN cookie is invalid. When this counter is updated, the received packet won’t be treated as a SYN cookie and the TcpExtSyncookiesRecv counter won’t be updated | count | host,container | proc fs |
| network | netstat_TcpExt_SyncookiesRecv | How many reply packets of the SYN cookies the TCP stack receives | count | host,container | proc fs |
| network | netstat_TcpExt_SyncookiesSent | It indicates how many SYN cookies are sent | count | host,container | proc fs |
| network | netstat_TcpExt_TCPACKSkippedChallenge | The ACK is skipped if the ACK is a challenge ACK | count | host,container | proc fs |
| network | netstat_TcpExt_TCPACKSkippedFinWait2 | The ACK is skipped in Fin-Wait-2 status, the reason would be either PAWS check fails or the received sequence number is out of window | count | host,container | proc fs |
| network | netstat_TcpExt_TCPACKSkippedPAWS | The ACK is skipped due to PAWS (Protect Against Wrapped Sequence numbers) check fails | count | host,container | proc fs |
| network | netstat_TcpExt_TCPACKSkippedSeq | The sequence number is out of window and the timestamp passes the PAWS check and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait | count | host,container | proc fs |
| network | netstat_TcpExt_TCPACKSkippedSynRecv | The ACK is skipped in Syn-Recv status. The Syn-Recv status means the TCP stack receives a SYN and replies SYN+ACK | count | host,container | proc fs |
| network | netstat_TcpExt_TCPACKSkippedTimeWait | The ACK is skipped in Time-Wait status, the reason would be either PAWS check failed or the received sequence number is out of window | count | host,container | proc fs |
| network | netstat_TcpExt_TCPAbortFailed | The kernel TCP layer will send RST if the RFC2525 2.17 section is satisfied. If an internal error occurs during this process, TcpExtTCPAbortFailed will be increased | count | host,container | proc fs |
| network | netstat_TcpExt_TCPAbortOnClose | Number of sockets closed when the user-mode program has data in the buffer | count | host,container | proc fs |
| network | netstat_TcpExt_TCPAbortOnData | It means TCP layer has data in flight, but need to close the connection | count | host,container | proc fs |
| network | netstat_TcpExt_TCPAbortOnLinger | When a TCP connection comes into FIN_WAIT_2 state, instead of waiting for the fin packet from the other side, kernel could send a RST and delete the socket immediately | count | host,container | proc fs |
| network | netstat_TcpExt_TCPAbortOnMemory | When an application closes a TCP connection, kernel still need to track the connection, let it complete the TCP disconnect process | count | host,container | proc fs |
| network | netstat_TcpExt_TCPAbortOnTimeout | This counter will increase when any of the TCP timers expire. In such situation, kernel won’t send RST, just give up the connection | count | host,container | proc fs |
| network | netstat_TcpExt_TCPAckCompressed | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPAutoCorking | When sending packets, the TCP layer will try to merge small packets to a bigger one | count | host,container | proc fs |
| network | netstat_TcpExt_TCPBacklogDrop | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPChallengeACK | The number of challenge acks sent | count | host,container | proc fs |
| network | netstat_TcpExt_TCPDSACKIgnoredNoUndo | When a DSACK block is invalid, one of these two counters would be updated. Which counter will be updated depends on the undo_marker flag of the TCP socket | count | host,container | proc fs |
| network | netstat_TcpExt_TCPDSACKIgnoredOld | When a DSACK block is invalid, one of these two counters would be updated. Which counter will be updated depends on the undo_marker flag of the TCP socket | count | host,container | proc fs |
| network | netstat_TcpExt_TCPDSACKOfoRecv | The TCP stack receives a DSACK, which indicate an out of order duplicate packet is received | count | host,container | proc fs |
| network | netstat_TcpExt_TCPDSACKOfoSent | The TCP stack receives an out of order duplicate packet, so it sends a DSACK to the sender | count | host,container | proc fs |
| network | netstat_TcpExt_TCPDSACKOldSent | The TCP stack receives a duplicate packet which has been acked, so it sends a DSACK to the sender | count | host,container | proc fs |
| network | netstat_TcpExt_TCPDSACKRecv | The TCP stack receives a DSACK, which indicates an acknowledged duplicate packet is received | count | host,container | proc fs |
| network | netstat_TcpExt_TCPDSACKUndo | Congestion window recovered without slow start using DSACK | count | host,container | proc fs |
| network | netstat_TcpExt_TCPDeferAcceptDrop | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPDelivered | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPDeliveredCE | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPFastOpenActive | When the TCP stack receives an ACK packet in the SYN-SENT status, and the ACK packet acknowledges the data in the SYN packet, the TCP stack understand the TFO cookie is accepted by the other side, then it updates this counter | count | host,container | proc fs |
| network | netstat_TcpExt_TCPFastOpenActiveFail | Fast Open attempts (SYN/data) failed because the remote does not accept it or the attempts timed out | count | host,container | proc fs |
| network | netstat_TcpExt_TCPFastOpenBlackhole | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPFastOpenCookieReqd | This counter indicates how many times a client wants to request a TFO cookie | count | host,container | proc fs |
| network | netstat_TcpExt_TCPFastOpenListenOverflow | When the pending fast open request number is larger than fastopenq->max_qlen, the TCP stack will reject the fast open request and update this counter | count | host,container | proc fs |
| network | netstat_TcpExt_TCPFastOpenPassive | This counter indicates how many times the TCP stack accepts the fast open request | count | host,container | proc fs |
| network | netstat_TcpExt_TCPFastOpenPassiveFail | This counter indicates how many times the TCP stack rejects the fast open request. It is caused by either the TFO cookie is invalid or the TCP stack finds an error during the socket creating process | count | host,container | proc fs |
| network | netstat_TcpExt_TCPFastRetrans | The TCP stack wants to retransmit a packet and the congestion control state is not ‘Loss’ | count | host,container | proc fs |
| network | netstat_TcpExt_TCPFromZeroWindowAdv | The TCP receive window is set to no-zero value from zero | count | host,container | proc fs |
| network | netstat_TcpExt_TCPFullUndo | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPHPAcks | If a packet set ACK flag and has no data, it is a pure ACK packet, if kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1 | count | host,container | proc fs |
| network | netstat_TcpExt_TCPHPHits | If a TCP packet has data (which means it is not a pure ACK packet), and this packet is handled in the fast path, TcpExtTCPHPHits will increase 1 | count | host,container | proc fs |
| network | netstat_TcpExt_TCPHystartDelayCwnd | The sum of CWND detected by packet delay. Dividing this value by TcpExtTCPHystartDelayDetect is the average CWND which detected by the packet delay | count | host,container | proc fs |
| network | netstat_TcpExt_TCPHystartDelayDetect | How many times the packet delay threshold is detected | count | host,container | proc fs |
| network | netstat_TcpExt_TCPHystartTrainCwnd | The sum of CWND detected by ACK train length. Dividing this value by TcpExtTCPHystartTrainDetect is the average CWND which detected by the ACK train length | count | host,container | proc fs |
| network | netstat_TcpExt_TCPHystartTrainDetect | How many times the ACK train length threshold is detected | count | host,container | proc fs |
| network | netstat_TcpExt_TCPKeepAlive | This counter indicates many keepalive packets were sent. The keepalive won’t be enabled by default. A userspace program could enable it by setting the SO_KEEPALIVE socket option | count | host,container | proc fs |
| network | netstat_TcpExt_TCPLossFailures | Number of connections that enter the TCP_CA_Loss phase and then undergo RTO timeout | count | host,container | proc fs |
| network | netstat_TcpExt_TCPLossProbeRecovery | A packet loss is detected and recovered by TLP | count | host,container | proc fs |
| network | netstat_TcpExt_TCPLossProbes | A TLP probe packet is sent | count | host,container | proc fs |
| network | netstat_TcpExt_TCPLossUndo | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPLostRetransmit | A SACK points out that a retransmission packet is lost again | count | host,container | proc fs |
| network | netstat_TcpExt_TCPMD5Failure | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPMD5NotFound | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPMD5Unexpected | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPMTUPFail | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPMTUPSuccess | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPMemoryPressures | Number of times TCP ran low on memory | count | host,container | proc fs |
| network | netstat_TcpExt_TCPMemoryPressuresChrono | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPMinTTLDrop | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPOFODrop | The TCP layer receives an out of order packet but doesn’t have enough memory, so drops it. Such packets won’t be counted into TcpExtTCPOFOQueue | count | host,container | proc fs |
| network | netstat_TcpExt_TCPOFOMerge | The received out of order packet has an overlay with the previous packet. the overlay part will be dropped. All of TcpExtTCPOFOMerge packets will also be counted into TcpExtTCPOFOQueue | count | host,container | proc fs |
| network | netstat_TcpExt_TCPOFOQueue | The TCP layer receives an out of order packet and has enough memory to queue it | count | host,container | proc fs |
| network | netstat_TcpExt_TCPOrigDataSent | Number of outgoing packets with original data (excluding retransmission but including data-in-SYN). This counter is different from TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is more useful to track the TCP retransmission rate | count | host,container | proc fs |
| network | netstat_TcpExt_TCPPartialUndo | Detected some erroneous retransmits, a partial ACK arrived while were fast retransmitting, so able to partially undo some of our CWND reduction | count | host,container | proc fs |
| network | netstat_TcpExt_TCPPureAcks | If a packet set ACK flag and has no data, it is a pure ACK packet, if kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1, if kernel handles it in the slow path, TcpExtTCPPureAcks will increase 1 | count | host,container | proc fs |
| network | netstat_TcpExt_TCPRcvCoalesce | When packets are received by the TCP layer and are not be read by the application, the TCP layer will try to merge them. This counter indicate how many packets are merged in such situation. If GRO is enabled, lots of packets would be merged by GRO, these packets wouldn’t be counted to TcpExtTCPRcvCoalesce | count | host,container | proc fs |
| network | netstat_TcpExt_TCPRcvCollapsed | This counter indicates how many skbs are freed during ‘collapse’ | count | host,container | proc fs |
| network | netstat_TcpExt_TCPRenoFailures | Number of failures that enter the TCP_CA_Disorder phase and then undergo RTO | count | host,container | proc fs |
| network | netstat_TcpExt_TCPRenoRecovery | When the congestion control comes into Recovery state, if sack is used, TcpExtTCPSackRecovery increases 1, if sack is not used, TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP stack begins to retransmit the lost packets | count | host,container | proc fs |
| network | netstat_TcpExt_TCPRenoRecoveryFail | Number of connections that enter the Recovery phase and then undergo RTO | count | host,container | proc fs |
| network | netstat_TcpExt_TCPRenoReorder | The reorder packet is detected by fast recovery. It would only be used if SACK is disabled | count | host,container | proc fs |
| network | netstat_TcpExt_TCPReqQFullDoCookies | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPReqQFullDrop | - | count | host,container | proc fs |
| network | netstat_TcpExt_TCPRetransFail | The TCP stack tries to deliver a retransmission packet to lower layers but the lower layers return an error | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSACKDiscard | This counter indicates how many SACK blocks are invalid. If the invalid SACK block is caused by ACK recording, the TCP stack will only ignore it and won’t update this counter | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSACKReneging | A packet was acknowledged by SACK, but the receiver has dropped this packet, so the sender needs to retransmit this packet | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSACKReorder | The reorder packet detected by SACK | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSYNChallenge | The number of challenge acks sent in response to SYN packets | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSackFailures | Number of failures that enter the TCP_CA_Disorder phase and then undergo RTO | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSackMerged | A skb is merged | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSackRecovery | When the congestion control comes into Recovery state, if sack is used, TcpExtTCPSackRecovery increases 1, if sack is not used, TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP stack begins to retransmit the lost packets | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSackRecoveryFail | When the congestion control comes into Recovery state, if sack is used, TcpExtTCPSackRecovery increases 1 | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSackShiftFallback | A skb should be shifted or merged, but the TCP stack doesn’t do it for some reasons | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSackShifted | A skb is shifted | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSlowStartRetrans | The TCP stack wants to retransmit a packet and the congestion control state is ‘Loss’ | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSpuriousRTOs | The spurious retransmission timeout detected by the F-RTO algorithm | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSpuriousRtxHostQueues | When the TCP stack wants to retransmit a packet, and finds that packet is not lost in the network, but the packet is not sent yet, the TCP stack would give up the retransmission and update this counter. It might happen if a packet stays too long time in a qdisc or driver queue | count | host,container | proc fs |
| network | netstat_TcpExt_TCPSynRetrans | Number of SYN and SYN/ACK retransmits to break down retransmissions into SYN, fast-retransmits, timeout retransmits, etc | count | host,container | proc fs |
| network | netstat_TcpExt_TCPTSReorder | The reorder packet is detected when a hole is filled | count | host,container | proc fs |
| network | netstat_TcpExt_TCPTimeWaitOverflow | Number of TIME_WAIT sockets unable to be allocated due to limit exceeding | count | host,container | proc fs |
| network | netstat_TcpExt_TCPTimeouts | TCP timeout events | count | host,container | proc fs |
| network | netstat_TcpExt_TCPToZeroWindowAdv | The TCP receive window is set to zero from a no-zero value | count | host,container | proc fs |
| network | netstat_TcpExt_TCPWantZeroWindowAdv | Depending on current memory usage, the TCP stack tries to set receive window to zero. But the receive window might still be a no-zero value | count | host,container | proc fs |
| network | netstat_TcpExt_TCPWinProbe | Number of ACK packets to be sent at regular intervals to make sure a reverse ACK packet opening back a window has not been lost | count | host,container | proc fs |
| network | netstat_TcpExt_TCPWqueueTooBig | - | count | host,container | proc fs |
| network | netstat_TcpExt_TW | TCP sockets finished time wait in fast timer | count | host,container | proc fs |
| network | netstat_TcpExt_TWKilled | TCP sockets finished time wait in slow timer | count | host,container | proc fs |
| network | netstat_TcpExt_TWRecycled | Time wait sockets recycled by time stamp | count | host,container | proc fs |
| network | netstat_Tcp_ActiveOpens | It means the TCP layer sends a SYN, and come into the SYN-SENT state. Every time TcpActiveOpens increases 1, TcpOutSegs should always increase 1 | count | host,container | proc fs |
| network | netstat_Tcp_AttemptFails | The number of times TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state | count | host,container | proc fs |
| network | netstat_Tcp_CurrEstab | The number of TCP connections for which the current state is either ESTABLISHED or CLOSE-WAIT | count | host,container | proc fs |
| network | netstat_Tcp_EstabResets | The number of times TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state | count | host,container | proc fs |
| network | netstat_Tcp_InCsumErrors | Incremented when a TCP checksum failure is detected | count | host,container | proc fs |
| network | netstat_Tcp_InErrs | The total number of segments received in error (e.g., bad TCP checksums) | count | host,container | proc fs |
| network | netstat_Tcp_InSegs | The number of packets received by the TCP layer. As mentioned in RFC1213, it includes the packets received in error, such as checksum error, invalid TCP header and so on | count | host,container | proc fs |
| network | netstat_Tcp_MaxConn | The limit on the total number of TCP connections the entity can support. In entities where the maximum number of connections is dynamic, this object should contain the value -1 | count | host,container | proc fs |
| network | netstat_Tcp_OutRsts | The number of TCP segments sent containing the RST flag | count | host,container | proc fs |
| network | netstat_Tcp_OutSegs | The total number of segments sent, including those on current connections but excluding those containing only retransmitted octets | count | host,container | proc fs |
| network | netstat_Tcp_PassiveOpens | The number of times TCP connections have made a direct transition to the SYN-RCVD state from the LISTEN state | count | host,container | proc fs |
| network | netstat_Tcp_RetransSegs | The total number of segments retransmitted - that is, the number of TCP segments transmitted containing one or more previously transmitted octets | count | host,container | proc fs |
| network | netstat_Tcp_RtoAlgorithm | The algorithm used to determine the timeout value used for retransmitting unacknowledged octets | count | host,container | proc fs |
| network | netstat_Tcp_RtoMax | The maximum value permitted by a TCP implementation for the retransmission timeout, measured in milliseconds. More refined semantics for objects of this type depend upon the algorithm used to determine the retransmission timeout | count | host,container | proc fs |
| network | netstat_Tcp_RtoMin | The minimum value permitted by a TCP implementation for the retransmission timeout, measured in milliseconds. More refined semantics for objects of this type depend upon the algorithm used to determine the retransmission timeout | count | host,container | proc fs |
| network | sockstat_FRAG_inuse | - | count | host,container | proc fs |
| network | sockstat_FRAG_memory | - | pages | host,container | proc fs |
| network | sockstat_RAW_inuse | Number of RAW socket used | count | host,container | proc fs |
| network | sockstat_TCP_alloc | The number of TCP sockets that have been allocated | count | host,container | proc fs |
| network | sockstat_TCP_inuse | Established TCP socket number | count | host,container | proc fs |
| network | sockstat_TCP_mem | The total size of TCP memory used by the system | pages | system | proc fs |
| network | sockstat_TCP_mem_bytes | The total size of TCP memory used by the system | bytes | system | sockstat_TCP_mem * page_size |
| network | sockstat_TCP_orphan | Number of TCP connections waiting to be closed | count | host,container | proc fs |
| network | sockstat_TCP_tw | Number of TCP sockets to be terminated | count | host,container | proc fs |
| network | sockstat_UDPLITE_inuse | - | count | host,container | proc fs |
| network | sockstat_UDP_inuse | Number of UDP socket used | count | host,container | proc fs |
| network | sockstat_UDP_mem | The total size of udp memory used by the system | pages | system | proc fs |
| network | sockstat_UDP_mem_bytes | The total number of bytes of udp memory used by the system | bytes | system | sockstat_UDP_mem * page_size |
| network | sockstat_sockets_used | The number of sockets used by the system | count | system | proc fs |