Kernel Packet Drop Tracing with eBPF Dropwatch

2026-07-01 HAO022, 胡洪 NET

In cloud-native environments, packet loss is one of the most difficult problems to troubleshoot. Application logs report connection timeouts; TCP retransmission rates spike. But where in the kernel protocol stack does the drop occur? Is it a firewall rule, a routing lookup failure, or a receive buffer overflow? Traditional tools like dropwatch or perf record -e skb:kfree_skb only provide the symbol address of the drop site. They lack packet context, making it hard to determine whether the dropped packet belongs to business traffic.

The HUATUO project’s dropwatch tool instruments the kernel tracepoint tracepoint/skb/kfree_skb and collects the complete context of each drop event in a single eBPF probe: the IP five-tuple, process information, network device, MAC addresses, and the kernel call stack that triggered the drop. More importantly, it supports tcpdump-style kernel-side filtering. The filter logic is compiled into eBPF bytecode at load time, so only matching packets are reported to userspace. This prevents a flood of irrelevant drop events from drowning out the real signal.

This article walks through a real troubleshooting scenario and introduces dropwatch’s workflow, filtering capabilities, output structure, and integration with huatuo-bamai for continuous packet drop observability.

Starting from a Production Incident

In a Kubernetes cluster, the P99 latency of an order service periodically spikes from 50 ms to 2 s. Application logs show sporadic connection reset by peer errors, and netstat -s on the node reveals a steady increase in TCPLostRetransmit.

The traditional troubleshooting path is: run tcpdump to confirm whether the peer received the SYN, check rules with iptables -L, verify the route with ip route get, and finally run perf record -e skb:kfree_skb to see kernel drop points. The entire process takes at least 30 minutes, and the symbol addresses from perf output must be manually resolved against the kernel symbol table.

With dropwatch, a single command locates the root cause:

sudo dropwatch --bpf-path bpf/dropwatch.o \
  --filter "tcp and port 8080" \
  --device eth0 \
  --duration 60 \
  --output json

Each event in the output carries the full drop context: which process owned the packet, the source and destination IP and port, and the kernel call stack at the point of the drop. Combined with jq filtering, the root cause can be found in seconds.

Kernel-Side Filtering: Blocking Irrelevant Drops in Kernel Space

In production, kfree_skb fires at a high rate. Neighbor table cleanup, socket closure, and driver DMA completion all trigger it, but none of these are data-plane packet drops. Filtering every event in userspace is CPU-intensive and can cause genuinely critical drop signals to be lost.

Dropwatch’s approach is to push the filter logic into kernel space. It includes a pure-Go pcap compiler (internal/pcapfilter) that compiles tcpdump-style filter expressions into eBPF bytecode at program load time. The bytecode is embedded directly into the probe logic. Only matching packets are submitted to userspace via the perf ring buffer.

Supported Filter Primitives

internal/pcapfilter implements a practical subset of the standard tcpdump syntax:

Protocol and Direction

ip   ip6   tcp   udp   icmp   icmp6   arp
ip proto tcp      ip6 proto udp
src host 10.0.0.1    dst host 10.0.0.1
src port 443         dst port 8080
src net 192.168.1.0/24

Boolean Composition

tcp and port 443
tcp or udp
not arp
ip and src net 192.168.1.0/24 and tcp dst port 3306

Unsupported expressions include byte offsets (tcp[tcpflags]), numeric protocol numbers (ip proto 6), and port ranges (portrange). Refer to the usage documentation for the complete list of unsupported features.

Filter Examples

# Monitor only TCP drops on a target port
--filter "tcp and port 443"

# Exclude noise from the metadata service IP
--filter "tcp and not host 169.254.169.254"

# Pinpoint drops from a specific subnet to a database port
--filter "src net 192.168.1.0/24 and tcp dst port 3306"

--filter and device filters (--device / --device-excluded) are orthogonal. When both are specified, they are both applied (AND semantics). A device whitelist drops SKBs that lack a net_device; a blacklist passes them through. This behavior is especially useful in container veth scenarios.

Event Output: Structured Drop Context

Each drop event is output in NDJSON format. The core fields are:

{
  "observed_timestamp": "2026-06-20T08:30:12.123456789Z",
  "comm": "nginx",
  "pid": 4521,
  "container_id": "a1b2c3d4",
  "netdev_name": "eth0",
  "packet_eth_proto": "0x0800",
  "packet_len": 74,
  "layers": {
    "label": "IPv4/TCP",
    "ether": { "src": "aa:bb:cc:dd:ee:ff", "dst": "11:22:33:44:55:66", "type": "IPv4" },
    "ipv4": { "src": "10.0.1.5", "dst": "10.0.2.10", "ttl": 64, "protocol": "TCP" },
    "tcp": { "sport": 54321, "dport": 8080, "flags": "SYN", "seq": 0, "ack": 0, "window": 65535, "sk_state": "SYN_SENT" }
  },
  "stack": "kfree_skb\ntcp_v4_rcv\ntcp_rcv_established\n..."
}

Notable fields:

layers: Layered protocol parsing results. Each layer is a nested object; missing layers are omitted automatically. Instead of relying on a separate protocol enumeration, downstream consumers determine the protocol combination by checking field existence (e.g., ev.Layers.TCP != nil).
stack: The full kernel call stack that triggered the drop, newline-separated. This is the key to distinguishing drop causes. For TCP drops alone, tcp_v4_rcv and ip_output point to entirely different troubleshooting directions.
container_id: Populated by huatuo-bamai from the memory cgroup CSS address or the network namespace cookie, providing a direct association to a Kubernetes Pod.
netdev_linkstatus: An array of network device link flags, useful for determining whether the device is in a carrier-down or dormant state.

Complete field list for each layers entry:

Layer	Fields
`ether`	`src`, `dst`, `type`, `len` (802.3 frames only)
`ipv4`	`version`, `ihl`, `tos`, `len`, `id`, `flags`, `frag_offset`, `ttl`, `protocol`, `checksum`, `src`, `dst`
`ipv6`	`version`, `traffic_class`, `flow_label`, `len`, `next_header`, `hop_limit`, `src`, `dst`
`tcp`	`sport`, `dport`, `seq`, `ack`, `data_offset`, `flags`, `window`, `checksum`, `urgent`, `sk_state`
`udp`	`sport`, `dport`, `len`, `checksum`
`icmp`	`type`, `code`, `checksum`, `id`, `seq`
`arp`	`addr_type`, `protocol`, `hw_address_size`, `prot_address_size`, `operation`, `sender_mac`, `sender_ip`, `target_mac`, `target_ip`

Userspace Filtering with jq

When kernel-side filtering is not precise enough, jq can be used to further refine the JSON output:

# Show only RST packets
sudo dropwatch --bpf-path bpf/dropwatch.o --output json 2>/dev/null \
  | jq 'select(.layers.tcp.flags == "RST")'

# Exclude events whose call stack contains ip_finish_output (typically normal routing output)
sudo dropwatch --output json --duration 10 --bpf-path bpf/dropwatch.o \
  | jq -c 'select(.stack | test("ip_finish_output") | not)'

# Show metadata only, without the call stack (useful for quick drop distribution statistics)
sudo dropwatch --output json --duration 10 --bpf-path bpf/dropwatch.o \
  | jq -c 'del(.stack)'

jq -c compacts each event into a single-line JSON, making it easy to save as an NDJSON file or continue piping.

Noise Filtering: Suppressing Non-Data-Plane Drops

Not all kfree_skb events represent actual data-plane packet drops. The following three categories are filtered by huatuo-bamai under the default configuration:

Pattern	Stack Signature	Why It Is Not a Drop
TCP `CLOSE_WAIT` + `skb_rbtree_purge`	Frame `skb_rbtree_purge/`	Normal socket closure: the kernel frees in-flight SKBs in `CLOSE_WAIT` sockets
ARP/neighbor table expiry	Frame `neigh_invalidate/`	Neighbor table entry cleanup; does not affect active data flows
bnxt NIC TX completion	`bnxt_tx_int/`	Broadcom bnxt driver frees SKBs after DMA transmission completes; normal behavior

In the huatuo-bamai configuration, noise rules are managed through EventTracing.IssuesList:

[EventTracing]
    IssuesList = [["neigh_invalidate", "neigh_invalidate"], ["bnxt_tx_int", "bnxt_tx_int"]]

[EventTracing.Dropwatch]
    Filter = "tcp"
    MaxEventsPerSecond = 100

If you need to observe neighbor-table-related drops (for example, when debugging an ARP storm), remove the corresponding rule from IssuesList.

Integration with huatuo-bamai: Continuous Packet Drop Observability

Running dropwatch once is suitable for ad-hoc troubleshooting, but production environments require continuous packet drop monitoring. huatuo-bamai launches dropwatch as a child process and uses --output-storage to send events over a Unix socket to its built-in processing pipeline, which ultimately stores them in Elasticsearch.

dropwatch \
  --bpf-path <CoreBpfDir>/dropwatch.o \
  --output-storage /var/run/huatuo/events.sock \
  --filter "tcp"

Once stored, you can:

Timeline correlation: Overlay drop events on application latency curves in Grafana to align drop timestamps with latency spikes.
Multi-dimensional aggregation: Aggregate drop distributions by container_id, netdev_name, or layers.label to quickly pinpoint the problematic Pod or device.
Historical analysis: Retain the full context of drop events, enabling post-incident root cause analysis without needing to reproduce the issue.

Command-Line Reference

Parameter	Default	Description
`--bpf-path <path>`	required	Path to the eBPF object file
`--filter <expr>`	(none)	tcpdump-style filter expression
`--device <names>`	(none)	Device whitelist, comma-separated (e.g., `eth0,eth1`)
`--device-excluded <names>`	(none)	Device blacklist; mutually exclusive with `--device`
`--duration <n>`	0	Exit after N seconds (0 = run indefinitely)
`--output <json\|text>`	`text`	Output format; ignored when `--output-storage` is set
`--output-storage <path>`	(none)	Send events to huatuo-bamai via Unix socket
`--task-id <id>`	(none)	Task ID to associate with this session
`--max-events-per-second <n>`	0	Global report rate limit; 0 = no limit

Summary

The difficulty of troubleshooting packet loss stems from the fact that drops occur deep inside the kernel protocol stack. Traditional tools lack packet context, making it hard to determine whether a drop affects business traffic. dropwatch instruments the kfree_skb tracepoint with an eBPF probe to collect the full drop context, and compiles filter logic into eBPF bytecode that executes in kernel space. This delivers precise drop events to operators without impacting host performance.

Combined with huatuo-bamai’s continuous integration capability, dropwatch evolves from an ad-hoc troubleshooting tool into a long-term packet drop observability infrastructure. It correlates kernel drops with application anomalies on a timeline, transforming the packet loss diagnosis process from “guess and verify” into “observe and attribute.”

Blog