AI 驱动的 SRE 值班排班优化：从轮值到智能调度-编程阁

AI 驱动的 SRE 值班排班优化：从轮值到智能调度

一、值班排班的经验主义困境：疲劳积累与技能错配

SRE 团队的值班排班通常采用轮值制——每人值班一周，依次轮换。但轮值制忽略了两个关键因素：一是疲劳积累，连续处理深夜告警的工程师，第二天的判断力显著下降；二是技能错配，数据库故障需要 DBA 技能，网络故障需要网络工程师，但轮值到谁就是谁，技能不匹配导致排障时间延长。

AI 驱动的排班优化，核心思路是：根据告警类型的历史分布、工程师的技能画像和疲劳状态，动态生成最优排班方案，最小化"技能错配率"和"疲劳风险"。

二、排班优化的架构设计与约束模型

排班优化是一个多约束优化问题：硬约束包括值班连续性（同一人不能连续值班超过 2 周）、时区覆盖（7×24 小时覆盖）和法定休息日；软约束包括技能匹配度最大化、疲劳风险最小化和偏好尊重。

flowchart TB A[排班输入] --> B[工程师技能画像] A --> C[告警历史分布] A --> D[约束条件] B --> E[技能匹配度评分] C --> F[告警类型预测] D --> G[硬约束: 连续性/覆盖/休息] D --> H[软约束: 疲劳/偏好/公平] E --> I[排班优化引擎] F --> I G --> I H --> I I --> J[最优排班方案] J --> K[技能匹配率: > 80%] J --> L[疲劳风险: < 20%] J --> M[公平性: 方差 < 10%]

三、生产级实现：排班优化引擎

# schedule_optimizer.py — AI 驱动的 SRE 排班优化引擎 from dataclasses import dataclass, field from typing import List, Dict, Optional, Set from enum import Enum from datetime import date, timedelta import random class SkillType(Enum): DATABASE = "database" NETWORK = "network" KUBERNETES = "kubernetes" APPLICATION = "application" SECURITY = "security" @dataclass class Engineer: id: str name: str skills: Dict[SkillType, float] # 技能熟练度 0-1 timezone: str last_oncall_end: Optional[date] = None # 上次值班结束日期 oncall_count_this_month: int = 0 preference: Dict[str, bool] = field(default_factory=dict) # 偏好 @dataclass class OncallSlot: date: date shift: str # "day" / "night" required_skills: List[SkillType] # 该时段最可能需要的技能 @dataclass class OncallAssignment: slot: OncallSlot engineer: Engineer skill_match_score: float fatigue_risk: float class ScheduleOptimizer: """排班优化引擎：多约束优化排班方案""" def optimize( self, engineers: List[Engineer], slots: List[OncallSlot], constraints: Dict, ) -> List[OncallAssignment]: """生成最优排班方案""" assignments = [] assigned_dates: Dict[str, Set[date]] = {e.id: set() for e in engineers} for slot in slots: # 为每个时段选择最优工程师 best_engineer = self._select_best_engineer( slot, engineers, assigned_dates, constraints ) if best_engineer: skill_score = self._calculate_skill_match( best_engineer, slot ) fatigue_risk = self._calculate_fatigue_risk( best_engineer, assigned_dates[best_engineer.id] ) assignments.append(OncallAssignment( slot=slot, engineer=best_engineer, skill_match_score=skill_score, fatigue_risk=fatigue_risk, )) assigned_dates[best_engineer.id].add(slot.date) return assignments def _select_best_engineer( self, slot: OncallSlot, engineers: List[Engineer], assigned_dates: Dict[str, Set[date]], constraints: Dict, ) -> Optional[Engineer]: """为指定时段选择最优工程师""" candidates = [] for eng in engineers: score = 0.0 # 因子 1：技能匹配度（权重 40%） skill_score = self._calculate_skill_match(eng, slot) score += skill_score * 40 # 因子 2：疲劳风险（权重 30%） fatigue = self._calculate_fatigue_risk( eng, assigned_dates[eng.id] ) score += (1 - fatigue) * 30 # 疲劳越低，分数越高 # 因子 3：公平性（权重 20%） max_oncall = max(e.oncall_count_this_month for e in engineers) fairness = 1 - (eng.oncall_count_this_month / max(max_oncall, 1)) score += fairness * 20 # 因子 4：偏好（权重 10%） if eng.preference.get(f"no_{slot.shift}", False): score -= 50 # 强烈惩罚违反偏好 if eng.preference.get(f"prefer_{slot.shift}", False): score += 10 # 硬约束检查 if not self._check_hard_constraints(eng, slot, assigned_dates, constraints): continue candidates.append((eng, score)) if not candidates: return None # 选择得分最高的工程师 candidates.sort(key=lambda x: x[1], reverse=True) return candidates[0][0] def _calculate_skill_match( self, engineer: Engineer, slot: OncallSlot ) -> float: """计算技能匹配度""" if not slot.required_skills: return 0.5 total = 0.0 for skill in slot.required_skills: total += engineer.skills.get(skill, 0.0) return total / len(slot.required_skills) def _calculate_fatigue_risk( self, engineer: Engineer, assigned_dates: Set[date] ) -> float: """计算疲劳风险：连续值班天数越多，风险越高""" if not assigned_dates: return 0.0 # 计算最近 7 天的值班天数 recent = sum(1 for d in assigned_dates if (date.today() - d).days <= 7) return min(recent / 5.0, 1.0) # 5 天以上为最高风险 def _check_hard_constraints( self, engineer: Engineer, slot: OncallSlot, assigned_dates: Dict[str, Set[date]], constraints: Dict, ) -> bool: """检查硬约束""" # 约束 1：连续值班不超过 max_consecutive_days max_consecutive = constraints.get("max_consecutive_days", 5) consecutive = self._count_consecutive_days( assigned_dates[engineer.id], slot.date ) if consecutive >= max_consecutive: return False # 约束 2：上次值班结束距今天至少 min_rest_days 天 if engineer.last_oncall_end: rest_days = (slot.date - engineer.last_oncall_end).days if rest_days < constraints.get("min_rest_days", 2): return False # 约束 3：本月值班次数不超过 max_per_month max_per_month = constraints.get("max_per_month", 8) if engineer.oncall_count_this_month >= max_per_month: return False return True def _count_consecutive_days( self, assigned: Set[date], target: date ) -> int: """计算到目标日期为止的连续值班天数""" count = 0 check = target - timedelta(days=1) while check in assigned: count += 1 check -= timedelta(days=1) return count

四、边界分析与架构权衡

AI 排班优化在生产落地中需要正视以下 Trade-off：

技能匹配的精度。工程师的技能画像需要定期更新，技能熟练度的评估存在主观性。建议从故障工单的"处理人"和"解决时间"数据中自动推断技能熟练度——处理某类故障越快，该技能的熟练度越高。

排班的灵活性。优化算法生成的排班方案可能无法覆盖突发情况（如工程师生病、紧急请假）。必须保留人工调整的入口，且调整后自动重新优化后续排班。

公平性的定义。不同工程师对"公平"的理解不同——有人认为值班次数相等就是公平，有人认为按技能分配更公平。建议将公平性定义为"值班负担的方差最小化"，综合考虑值班次数、夜班次数和疲劳程度。

适用边界：排班优化最适合 5 人以上的 SRE 团队。3-4 人的小团队排班空间有限，优化收益不大。

五、总结

AI 驱动的 SRE 排班优化，将值班安排从"简单轮值"推进到"智能调度"。核心模型：技能匹配度 × 疲劳风险 × 公平性 × 偏好的多因子评分，硬约束保证合规，软约束优化质量。落地建议：第一，从故障工单数据自动推断技能画像；第二，保留人工调整入口应对突发情况；第三，定期收集工程师反馈，优化评分权重。关键原则：排班优化的目标不是"完美排班"，而是"减少技能错配和疲劳积累"——每一次技能匹配的排班，都可能将故障恢复时间缩短 30%。

AI 驱动的 SRE 值班排班优化：从轮值到智能调度