最后更新: 2025-09-28, 作者: hao022
背景
传统网卡单上联架构中,网卡、光纤或交换机出现异常或升级时,均可能导致网络中断。而采用双上联架构时,服务器上的两个端口分别连接至不同的交换机,并通过绑定技术形成一个逻辑端口对外提供服务。当任一上联链路或接入层交换机发生故障时,流量可自动切换至另一端口,从而保障训练任务不中断。该设计有效避免了单上联架构中的单点故障风险,提升了系统互联的鲁棒性。此外,双上联架构也为交换机的热升级提供了条件,便于网络运维和功能迭代。
堆叠技术
堆叠技术通过物理线缆将两台交换机连接,在逻辑上虚拟为一台设备进行数据转发。该技术可扩展端口数量、提升带宽,并简化网络结构。然而,堆叠也存在一定弊端,例如软件升级可能导致业务中断,堆叠线缆故障可能引发交换机分裂。需注意,服务器与交换机之间可采用单根或多根(绑定)网线连接。
去堆叠技术
-
去堆叠架构取消了两台交换机之间的物理连接,以降低故障风险。服务器通过两条链路分别连接两台独立交换机,并采用 LACP 链路聚合。此时,交换机需配置相同的系统 ID 和不同的端口 ID,以“欺骗”服务器建立聚合链路。
此外,需要在 Linux 内核侧支持 ARP/ND 广播,这样两台交换机能够接收到 ARP/ND 报文,实现会话同步。Linux 内核已经支持该特性:
net: bonding: add broadcast_neighbor option for 802.3ad
net: bonding: add broadcast_neighbor netlink option
net: bonding: send peer notify when failure recovery
net: bonding: add broadcast_neighbor option for 802.3ad Stacking technology is a type of technology used to expand ports on Ethernet switches. It is widely used as a common access method in large-scale Internet data center architectures. Years of practice have proved that stacking technology has advantages and disadvantages in high-reliability network architecture scenarios. For instance, in stacking networking arch, conventional switch system upgrades require multiple stacked devices to restart at the same time. Therefore, it is inevitable that the business will be interrupted for a while. It is for this reason that "no-stacking" in data centers has become a trend. Additionally, when the stacking link connecting the switches fails or is abnormal, the stack will split. Although it is not common, it still happens in actual operation. The problem is that after the split, it is equivalent to two switches with the same configuration appearing in the network, causing network configuration conflicts and ultimately interrupting the services carried by the stacking system. To improve network stability, "non-stacking" solutions have been increasingly adopted, particularly by public cloud providers and tech companies like Alibaba, Tencent, and Didi. "non-stacking" is a method of mimicing switch stacking that convinces a LACP peer, bonding in this case, connected to a set of "non-stacked" switches that all of its ports are connected to a single switch (i.e., LACP aggregator), as if those switches were stacked. This enables the LACP peer's ports to aggregate together, and requires (a) special switch configuration, described in the linked article, and (b) modifications to the bonding 802.3ad (LACP) mode to send all ARP/ND packets across all ports of the active aggregator. Note that, with multiple aggregators, the current broadcast mode logic will send only packets to the selected aggregator(s). +-----------+ +-----------+ | switch1 | | switch2 | +-----------+ +-----------+ ^ ^ | | +-----------------+ | bond4 lacp | +-----------------+ | | | NIC1 | NIC2 +-----------------+ | server | +-----------------+ - https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/ Cc: Jay Vosburgh <jv@jvosburgh.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com> Link: https://patch.msgid.link/84d0a044514157bb856a10b6d03a1028c4883561.1751031306.git.tonghao@bamaicloud.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
为解决可能的流量黑洞问题(如交换机下联一条物理链路故障后,服务器是感知不到该事件的仍向故障交换机发送数据导致丢包),需将所有流量导向三层路由。即使同一接入层(TOR)内的服务器间通信,也需经过三层路由转发,避免二层流量直连造成的黑洞。
-
为实现三层路由,需启用 ARP/ND 代理功能,并为两台交换机配置相同的网关 MAC 地址。同时,为避免组播干扰,需关闭交换机的组播功能。
interface Vlanif3 description K8S-Access ip address 10.0.8.1 255.255.248.0 arp timeout 90 arp proxy anyway enable mac-address 0000-8989-0001 arp delete trigger link-down enable arp direct-route enable arp direct-route delay 120 arp direct-route preference 1
-
交换机将学习到的服务器 ARP/ND 信息转换为主机路由,并通告给上层交换机,从而确保回程流量可正确送达。
-
为保障高可用性,需启用上联接口监控:若交换机的所有上联接口均断开,则自动关闭下行接口;当上联接口恢复时,下行接口延迟启用。服务器侧,当 slave 链路断开后,绑定模块会立即将其移出发送队列;链路恢复后,需等待 LACP 协商成功,该链路才会重新启用。
-
如何监控故障。HUATUO 项目提供了 LACP/NETDEV-HW 监控,这样能够提前感知故障的发生。在实际的生产环境,我们确实检测到了很多因 AOC 模块不可用导致 LACP 协议异常,因网络硬件丢包导致的网络不可用。
# git clone https://github.com/ccfos/huatuo.git # ls -l huatuo/bpf/ total 84 -rw-r--r-- 1 root root 1677 Aug 11 01:44 cgroup_css_events.c -rw-r--r-- 1 root root 1450 Aug 31 22:24 cgroup_css_gather.c -rw-r--r-- 1 root root 4539 Aug 11 01:44 dropwatch.c -rw-r--r-- 1 root root 801 Aug 29 10:16 hungtask.c drwxr-xr-x 2 root root 4096 Aug 29 10:16 include -rw-r--r-- 1 root root 648 Aug 11 01:44 lacp.c -rw-r--r-- 1 root root 2155 Aug 11 01:44 memory_free_compact.c -rw-r--r-- 1 root root 1424 Aug 11 01:44 memory_reclaim.c -rw-r--r-- 1 root root 1421 Aug 11 01:44 memory_reclaim_events.c -rw-r--r-- 1 root root 1044 Sep 6 03:09 netdev_hw.c -rw-r--r-- 1 root root 4219 Aug 11 01:44 netrecvlat.c -rw-r--r-- 1 root root 1553 Sep 5 09:26 oom.c -rw-r--r-- 1 root root 1491 Aug 18 07:06 perf.c -rw-r--r-- 1 root root 8683 Aug 11 01:44 runqlat_tracing.c -rw-r--r-- 1 root root 1879 Sep 26 04:32 softirq.c -rw-r--r-- 1 root root 3429 Aug 11 01:44 softirq_tracing.c -rw-r--r-- 1 root root 1117 Aug 29 10:16 softlockup.c
其他
- lacp edge-port 非常有意思的一个功能,在无法使用或者初始化配置前,对双上联网络的一种降级,能够保障网络的可用性。
- Mirror On Drop (MOD) 日常运维,如果无法确认哪里出现丢包(服务器 tcpdump 已经发送,但是交换机无接收),可以尝试在交换机侧配置 MOD。
篇尾:
- 关注微信公众号【HUATUO 开源技术】留言,或扫码添加工作人员微信,邀请您加入用户群(请备注姓名+单位):
- 代码仓库:https://github.com/ccfos/huatuo
- 官方网站:https://huatuo.tech/