最后更新: 2026-03-08, 作者: HAO022
3月5日,十四届全国人大四次会议,国务院总理李强向 《政府工作报告》中明确提出,“支持人工智能开源社区建设,促进开源生态繁荣”,中国开源史上悄然翻开新章程。

HUATUO 新版本开源 AI 计算场景下硬件故障检测功能!
业务痛点
大规模分布式模型训练需要在多达数千台机器上协同训练。如此庞大复杂的系统涉及海量的计算、通信、存储资源以及软件支持,因此发生故障的概率很高,可能导致训练任务失败。根据实际经验,每个训练任务平均每天会遇到约两次故障,有时会导致训练整体暂停数小时,故障检测已成为分布式训练的重要挑战。

注:故障数量随着训练规模增长而增长。
根据观察,训练中故障通常起源于单机问题(例如 CUDA 错误、NVLink 错误),但会产生级联效应,最终导致整个任务中断或明显减速。例如通信过程中的硬件 ECC 错误可能触发 NCCL 超时或网络断连,进而导致数百台机器同时停止并空闲。
真实案例 PCIe 降速,某次 128 台机器训练任务因一台机器的 PCIe 链路从 800Mbps 降至 500Mbps,导致通信严重拥堵,数千张 V100 GPU 利用率大幅下降,整整慢了 40 分钟。
故障类型
下列出了常见故障类型、发生频率,以及各类监控指标能够表征该故障的比例。

通过分析上述表格,结论如下:硬件故障占主导(55.8%),其中 ECC 错误占比最高(38.9%)。单一指标无法覆盖所有故障,因此需要一种硬件级故障检测能力,而 HUATUO 华佗正式开源了该能力。
HUATUO 最新版本支持硬件故障类型:
- CPU, L1/L2/L3 Cache, TLB
- Memory, ECC
- PCIe
- Network Interface Card Link
- PFC/RDMA
- ACPI
- GPU MetaX
实战应用
HUATUO 通过事件触发实时感知各硬件模块上报的故障信息:故障类型,设备标识,错误信息,时间戳等。
网卡故障,该故障信息被存储在部署华佗组件的服务器,huatuo-local/netdev_event,以及配置的 Elasticsearch 存储服务。其中本地存储的信息格式如下:
{
"hostname": "your-host-name",
"region": "xxx",
"uploaded_time": "2026-03-05T18:28:39.153438921+08:00",
"time": "2026-03-05 18:28:39.153 +0800",
"tracer_name": "netdev_event",
"tracer_time": "2026-03-05 18:28:39.153 +0800",
"tracer_type": "auto",
"tracer_data": {
"ifname": "eth0",
"index": 2,
"linkstatus": "linkstatus_admindown",
"mac": "5c:6f:11:11:11:11",
"start": false
}
}
.linkstatus 数值类型还可能为:
linkstatus_adminup # 管理员开启网卡,例如通过 ip link set dev eth0 up
linkstatus_admindown # 管理员关闭网卡,例如通过 ip link set dev eth0 down
linkstatus_carrierup # 物理链路恢复
linkstatus_carrierdown # 物理链路故障
网卡故障,硬件丢包指标:
huatuo_bamai_buddyinfo_blocks{host="hostname",region="xxx",device="eth0",driver="ixgbe"} 0
网卡 RDMA PFC 网络拥塞:
# HELP huatuo_bamai_netdev_dcb_pfc_received_total count of the received pfc frames
# TYPE huatuo_bamai_netdev_dcb_pfc_received_total counter
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="0",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="1",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="2",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="3",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="4",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="5",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="6",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="7",region="xxx"} 0
# HELP huatuo_bamai_netdev_dcb_pfc_send_total count of the sent pfc frames
# TYPE huatuo_bamai_netdev_dcb_pfc_send_total counter
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="0",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="1",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="2",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="3",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="4",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="5",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="6",region="xxx"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="7",region="xxx"} 0
Linux 内核 RAS 硬件故障指标:
huatuo_bamai_ras_hw_total{host="hostname",region="xxx"} 0
{
"hostname": "your-host-name",
"region": "xxx",
"uploaded_time": "2026-03-01T15:41:13.027353585+08:00",
"time": "2026-03-01 15:41:13.027 +0800",
"tracer_name": "ras",
"tracer_time": "2026-03-01 15:41:13.027 +0800",
"tracer_type": "auto",
"tracer_data": {
"dev": "MEM",
"event": "EDAC",
"type": "CORRECTED",
"timestamp": 26870134986481080,
"info": "1 CORRECTED err: memory read error on CPU_SrcID#0_MC#1_Chan#0_DIMM#0 (mc: 1 location:0:0:-1 address: 0x3ddc84140 grain:32 syndrome:0x0 err_code:0x0101:0x0090 ProcessorSocketId:0x0 MemoryControllerId:0x1 PhysicalRankId:0x0 Row:0x15da Column:0x100 Bank:0x3 BankGroup:0x1 retry_rd_err_log[0001a209 00000000 00800000 0440d001 000015da] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000])"
}
}
内存故障
如下是线上机器内存故障监控,以及详细故障信息。

触发故障的设备:MEM
故障类型:CORRECTED
tracer触发时间:2026-03-01 15:41:13.027
触发源:EDAC
具体设备反馈信息:
memory read error on CPU_SrcID#0_MC#1_Chan#0_DIMM#0 (mc: 1 location:0:0:-1 address: 0x3ddc84140 grain:32 syndrome:0x0 err_code:0x0101:0x0090 ProcessorSocketId:0x0 MemoryControllerId:0x1 PhysicalRankId:0x0 Row:0x15da Column:0x100 Bank:0x3 BankGroup:0x1 retry_rd_err_log[0001a209 00000000 00800000 0440d001 000015da] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000])
网卡故障
如下是HUATUO 华佗捕获的线上服务器的网卡物理链路不可用事件(最终确认光模块弱导致),该故障导致使线上业务应用指标抖动。

交换机侧确认光模块问题

原理剖析
HUATUO 华佗总体架构如下,支持各种硬件类型故障检查。

HUATUO 基于 Linux 内核 MCE 和 RAS 技术,通过 eBPF 捕获关键硬件事件,获取硬件设备信息。RAS 在 Linux 内核一直在不断演进发展,从内核 2.6 版本开始逐步的引入更多 tracepoint 点。这种轻量级,事件驱动的实现方式能够覆盖绝大多数高频硬件故障场景。此外 HUATUO 还支持 PFC/RDMA,网卡物理链路状态的检查。

篇尾:
- HUATUO(华佗)是由滴滴开源并依托 CCF 孵化的操作系统深度观测项目。
- 关注微信公众号,或扫码加微信,邀请你加入用户群(请备注姓名+单位):
