Prometheus+Grafana 深度监控:从指标采集到多级告警的生产级部署
一、监控盲区酿成的故障:当关键指标被遗忘在采集之外
一次线上事故的复盘会上,团队发现一个令人后怕的事实:数据库连接池耗尽导致的级联故障,其实在前 20 分钟就有征兆——连接池使用率从 60% 飙升到 95%,但这条指标从未被采集过。Prometheus 只监控了 CPU、内存、QPS 这些"面子指标",而连接池、线程池、GC 停顿这些"里子指标"被完全忽略。
这不是个例。很多团队的监控体系存在三个典型盲区:第一,指标覆盖不全,只采基础设施指标,忽略应用层业务指标;第二,告警规则粗糙,所有指标统一阈值,不考虑业务周期和潮汐效应;第三,Grafana 面板堆砌,几十个 Dashboard 却找不到关键信息,故障时反而增加认知负担。
一套生产级监控体系,需要从指标采集策略、Prometheus 高可用架构、智能告警分级、Grafana 面板治理四个维度系统建设。
二、Prometheus 监控体系的架构与数据流
graph TB subgraph 采集层 NA[Node Exporter<br/>节点指标] KA[kube-state-metrics<br/>K8s资源指标] CA[cAdvisor<br/>容器指标] BA[业务应用 Exporter<br/>自定义指标] PA[Pushgateway<br/>短任务指标] end subgraph 存储层 PH[Prometheus HA Pair<br/>主实例+副本实例] TS[Thanos Sidecar<br/>上传至对象存储] OS[对象存储 S3/OSS<br/>长期历史数据] end subgraph 查询层 TQ[Thanos Query<br/>统一查询入口] TG[Thanos Store Gateway<br/>历史数据网关] end subgraph 展示与告警层 GF[Grafana<br/>可视化面板] AM[Alertmanager<br/>告警路由与抑制] PG[PagerDuty/企微<br/>告警通知渠道] end NA --> PH KA --> PH CA --> PH BA --> PH PA --> PH PH --> TS TS --> OS OS --> TG PH --> TQ TG --> TQ TQ --> GF PH --> AM AM --> PGPrometheus 的拉取模型(Pull Model)决定了它天然适合监控稳定运行的服务。但对于短生命周期任务(如 CronJob),需要借助 Pushgateway 中转。高可用方案采用 Prometheus HA Pair——两个独立实例采集相同目标,通过 Thanos Sidecar 将数据上传至对象存储,Thanos Query 作为统一查询入口实现全局视图。
三、生产级监控体系的代码实现
3.1 Prometheus 高可用部署与自定义指标采集
# Prometheus HA Pair - 主实例部署 apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus-main namespace: monitoring spec: replicas: 2 # 两个副本互为HA serviceName: prometheus-main selector: matchLabels: app: prometheus-main template: metadata: labels: app: prometheus-main spec: serviceAccountName: prometheus containers: - name: prometheus image: prom/prometheus:v2.48.0 args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention.time=15d" # 本地保留15天 - "--storage.tsdb.retention.size=80GB" # 本地存储上限 - "--web.enable-lifecycle" - "--web.enable-remote-write-receiver" ports: - containerPort: 9090 resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" volumeMounts: - name: config mountPath: /etc/prometheus - name: data mountPath: /prometheus - name: thanos-sidecar image: thanosio/thanos:v0.32.0 args: - "sidecar" - "--tsdb.path=/prometheus" - "--prometheus.url=http://localhost:9090" - "--objstore.config-file=/etc/thanos/objstore.yml" - "--shipper.upload-compacted" # 上传已压缩数据块 volumeMounts: - name: data mountPath: /prometheus - name: thanos-config mountPath: /etc/thanos volumes: - name: config configMap: name: prometheus-config - name: thanos-config secret: secretName: thanos-objstore-config volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: ssd resources: requests: storage: 100Gi3.2 业务应用自定义指标埋点(Python)
#!/usr/bin/env python3 """业务应用 Prometheus 指标埋点:连接池、线程池、业务计数""" import time import threading from functools import wraps from prometheus_client import ( Counter, Gauge, Histogram, Summary, CollectorRegistry, generate_latest, CONTENT_TYPE_LATEST ) from flask import Flask, Response # 使用独立 Registry,避免与默认 Registry 冲突 registry = CollectorRegistry() # ===== 连接池指标 ===== db_pool_active = Gauge( 'db_pool_active_connections', '当前活跃数据库连接数', ['pool_name'], registry=registry ) db_pool_idle = Gauge( 'db_pool_idle_connections', '当前空闲数据库连接数', ['pool_name'], registry=registry ) db_pool_wait_count = Counter( 'db_pool_wait_total', '等待获取连接的总次数', ['pool_name'], registry=registry ) db_pool_wait_duration = Histogram( 'db_pool_wait_duration_seconds', '等待获取连接的耗时分布', ['pool_name'], buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0], registry=registry ) # ===== 业务指标 ===== business_request_total = Counter( 'business_request_total', '业务请求总数', ['service', 'method', 'status'], registry=registry ) business_request_duration = Histogram( 'business_request_duration_seconds', '业务请求处理耗时', ['service', 'method'], buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0], registry=registry ) order_amount = Summary( 'order_amount_total', '订单金额统计', ['channel'], registry=registry ) def track_db_pool(pool_name: str, pool_obj): """定期采集数据库连接池指标,需在后台线程中运行""" while True: try: # 适配 SQLAlchemy 连接池 db_pool_active.labels(pool_name=pool_name).set( pool_obj.checkedout() ) db_pool_idle.labels(pool_name=pool_name).set( pool_obj.checkedin() ) except Exception as e: # 采集失败不影响业务,记录日志即可 import logging logging.getLogger(__name__).warning( "采集连接池指标失败: %s", e ) time.sleep(5) # 每5秒采集一次 def track_request(service: str, method: str): """装饰器:自动记录请求计数和耗时""" def decorator(func): @wraps(func) def wrapper(*args, **kwargs): start = time.monotonic() status = "success" try: result = func(*args, **kwargs) return result except Exception as e: status = "error" raise finally: duration = time.monotonic() - start business_request_total.labels( service=service, method=method, status=status ).inc() business_request_duration.labels( service=service, method=method ).observe(duration) return wrapper return decorator app = Flask(__name__) @app.route('/metrics') def metrics(): """Prometheus 指标暴露端点""" return Response( generate_latest(registry), mimetype=CONTENT_TYPE_LATEST )3.3 多级告警规则与抑制策略
# Prometheus 告警规则 - 多级分类 groups: # ===== P0 紧急告警:5分钟内必须响应 ===== - name: critical_alerts rules: - alert: ServiceDown expr: up == 0 for: 2m labels: severity: critical team: sre annotations: summary: "服务 {{ $labels.instance }} 宕机" runbook: "https://wiki.internal/runbook/service-down" - alert: DBPoolExhausted expr: db_pool_active_connections / (db_pool_active_connections + db_pool_idle_connections) > 0.95 for: 1m labels: severity: critical team: dba annotations: summary: "数据库连接池 {{ $labels.pool_name }} 即将耗尽" runbook: "https://wiki.internal/runbook/db-pool-exhausted" # ===== P1 重要告警:30分钟内响应 ===== - name: warning_alerts rules: - alert: HighMemoryUsage expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.85 for: 5m labels: severity: warning team: sre annotations: summary: "节点 {{ $labels.instance }} 内存使用率超过85%" - alert: PodRestartLoop expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 5m labels: severity: warning team: dev annotations: summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 1小时内重启超过5次" # ===== P2 提醒告警:工作时间处理 ===== - name: info_alerts rules: - alert: DiskSpaceLow expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) > 0.80 for: 30m labels: severity: info team: sre annotations: summary: "节点 {{ $labels.instance }} 磁盘使用率超过80%"# Alertmanager 配置 - 路由、抑制与静默 global: resolve_timeout: 5m http_config: tls_config: insecure_skip_verify: false route: receiver: 'default' group_by: ['alertname', 'cluster', 'namespace'] group_wait: 30s # 同组告警等待30秒合并 group_interval: 5m # 同组告警间隔5分钟 repeat_interval: 4h # 重复告警间隔4小时 routes: # P0 告警:立即电话通知 - match: severity: critical receiver: 'critical-pager' group_wait: 10s repeat_interval: 30m # P1 告警:企微通知 - match: severity: warning receiver: 'warning-wechat' group_wait: 1m repeat_interval: 2h # P2 告警:仅邮件 - match: severity: info receiver: 'info-email' group_wait: 5m repeat_interval: 24h # 抑制规则:高级别告警抑制低级别 inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: ['cluster', 'namespace'] # 同集群同命名空间 - source_match: alertname: ServiceDown target_match: alertname: HighMemoryUsage equal: ['instance'] # 同实例 receivers: - name: 'default' webhook_configs: - url: 'http://alertmanager-webhook:8080/api/v1/alerts' - name: 'critical-pager' pagerduty_configs: - routing_key: '<routing-key>' - name: 'warning-wechat' webhook_configs: - url: 'http://wechat-webhook:8080/api/v1/send' - name: 'info-email' email_configs: - to: 'sre-team@company.com' from: 'alertmanager@company.com' smarthost: 'smtp.company.com:587'3.4 Grafana 面板自动生成脚本
#!/usr/bin/env python3 """Grafana Dashboard 自动生成:基于服务拓扑自动创建监控面板""" import json import requests from typing import Dict, List class DashboardGenerator: """根据服务配置自动生成 Grafana Dashboard""" def __init__(self, grafana_url: str, api_key: str): self.grafana_url = grafana_url.rstrip("/") self.headers = { "Authorization": f"Bearer {api_key}", "Content-Type": "application/json" } def generate_service_dashboard( self, service_name: str, namespace: str, metrics: List[str], datasource: str = "Prometheus" ) -> Dict: """为单个服务生成监控面板""" panels = [] y_position = 0 # 基础资源面板行 panels.append(self._create_row_panel( title="基础设施指标", y_pos=y_position )) y_position += 1 # CPU 使用率面板 panels.append(self._create_timeseries_panel( title="CPU 使用率", expr=f'sum(rate(container_cpu_usage_seconds_total{{namespace="{namespace}",pod=~"{service_name}-.*"}}[5m])) by (pod)', y_pos=y_position, height=8, unit="percentunit", legend_format="{{pod}}" )) y_position += 8 # 内存使用面板 panels.append(self._create_timeseries_panel( title="内存使用", expr=f'container_memory_working_set_bytes{{namespace="{namespace}",pod=~"{service_name}-.*"}}', y_pos=y_position, height=8, unit="bytes", legend_format="{{pod}}" )) y_position += 8 # 业务指标面板行 if metrics: panels.append(self._create_row_panel( title="业务指标", y_pos=y_position )) y_position += 1 for metric in metrics: panels.append(self._create_timeseries_panel( title=metric, expr=metric, y_pos=y_position, height=8, legend_format="{{instance}}" )) y_position += 8 dashboard = { "dashboard": { "title": f"{service_name} - 服务监控", "tags": [namespace, "auto-generated"], "timezone": "browser", "panels": panels, "templating": { "list": [ { "name": "namespace", "type": "datasource", "query": datasource, "current": {"text": namespace, "value": namespace} } ] }, "refresh": "30s", "time": {"from": "now-1h", "to": "now"} }, "overwrite": True } return dashboard def _create_timeseries_panel( self, title: str, expr: str, y_pos: int, height: int = 8, unit: str = "short", legend_format: str = "" ) -> Dict: """创建时序图面板""" return { "type": "timeseries", "title": title, "gridPos": {"h": height, "w": 12, "x": 0, "y": y_pos}, "fieldConfig": { "defaults": { "unit": unit, "custom": { "drawStyle": "line", "lineInterpolation": "smooth", "fillOpacity": 10 } } }, "targets": [ { "expr": expr, "legendFormat": legend_format, "refId": "A" } ] } def _create_row_panel(self, title: str, y_pos: int) -> Dict: """创建行分隔面板""" return { "type": "row", "title": title, "gridPos": {"h": 1, "w": 24, "x": 0, "y": y_pos}, "collapsed": False } def push_dashboard(self, dashboard: Dict) -> bool: """推送 Dashboard 到 Grafana""" url = f"{self.grafana_url}/api/dashboards/db" resp = requests.post( url, headers=self.headers, json=dashboard, timeout=30 ) if resp.status_code == 200: result = resp.json() print(f"Dashboard 创建成功: {result.get('url', '')}") return True else: print(f"Dashboard 创建失败: {resp.status_code} {resp.text}") return False四、监控体系的架构权衡与适用边界
4.1 Pull vs Push 的取舍
Prometheus 的 Pull 模型简化了服务发现,但存在天然限制:NAT 网络后的服务无法被拉取,短生命周期任务可能在采集间隔内已退出。Pushgateway 解决了短任务问题,但引入了单点风险——Pushgateway 本身需要高可用部署,且数据不会自动过期,需要定期清理。对于大规模短任务场景,建议使用 Prometheus 的 Remote Write 功能,直接推送到 Thanos Receive。
4.2 本地存储 vs 远程存储
Prometheus 本地 TSDB 查询性能好,但扩展性有限。单实例推荐存储不超过 1000 万时间序列,超过后查询延迟显著上升。Thanos 方案将历史数据上传至对象存储,查询时由 Store Gateway 按需加载,但历史数据查询延迟比本地高 2-5 倍。建议:15 天内热数据存本地,15 天以上冷数据存对象存储。
4.3 告警分级的现实挑战
三级告警(P0/P1/P2)理论上清晰,但实际落地时,P0 告警的判定条件很难精确。例如,"服务宕机"是 P0,但"服务响应变慢"算 P1 还是 P0?如果慢到超时呢?建议采用渐进式告警——同一指标设置多个阈值,持续时间越长级别越高,避免"一刀切"导致的告警分级混乱。
4.4 禁用场景
以下场景不适合 Prometheus 体系:第一,超高基数指标(如每用户维度的指标),会导致 TSDB 膨胀,应使用 ClickHouse 等列式存储;第二,毫秒级采集精度需求,Prometheus 最小采集间隔为 1 秒,更细粒度需用专用 APM 工具;第三,跨集群全局实时聚合查询,Thanos Query 的跨实例查询延迟较高,应考虑 Mimir 或 VictoriaMetrics。
五、总结
生产级 Prometheus+Grafana 监控体系的建设,核心在于"全"和"准"。"全"是指指标覆盖从基础设施到应用层再到业务层,不留盲区;"准"是指告警分级与业务风险匹配,避免告警疲劳。高可用架构采用 Prometheus HA Pair + Thanos 方案,兼顾本地查询性能和长期存储需求。Grafana 面板应按服务自动生成,避免手工维护的混乱。监控体系的边界在于:高基数场景不适合 Prometheus,毫秒级精度需要专用工具,跨集群聚合查询应选择更合适的时序数据库。好的监控系统,是让运维工程师在故障发生前就能看到趋势,而不是在告警洪流中找线索。