Kubernetes日志管理与分析
引言
日志是 Kubernetes 集群中故障排查、性能监控和安全审计的重要数据来源。有效的日志管理策略能够帮助运维团队快速定位问题、分析系统行为。本文将深入探讨 Kubernetes 日志管理的最佳实践和分析方法。
一、日志架构概述
1.1 日志层次结构
┌─────────────────────────────────────────────────────────────┐ │ Kubernetes 日志架构 │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 应用层日志 │ │ │ │ - 容器应用日志 │ │ │ │ - 应用程序日志 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 容器运行时日志 │ │ │ │ - Docker/containerd 日志 │ │ │ │ - 容器启动/停止日志 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 节点系统日志 │ │ │ │ - kubelet/kube-proxy 日志 │ │ │ │ - 操作系统日志 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 控制平面日志 │ │ │ │ - API Server/etcd 日志 │ │ │ │ - Scheduler/Controller Manager 日志 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘1.2 日志类型对比
| 日志类型 | 来源 | 内容 | 重要性 |
|---|---|---|---|
| 应用日志 | 容器内应用 | 业务逻辑日志 | 高 |
| 容器日志 | 容器运行时 | 容器生命周期 | 中 |
| 节点日志 | kubelet/系统 | 节点状态 | 高 |
| 控制平面日志 | 集群组件 | 集群管理 | 高 |
二、日志收集方案
2.1 日志收集架构
apiVersion: apps/v1 kind: DaemonSet metadata: name: fluentd namespace: kube-system spec: selector: matchLabels: app: fluentd template: metadata: labels: app: fluentd spec: serviceAccountName: fluentd tolerations: - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: fluentd image: fluent/fluentd-kubernetes-daemonset:v1.15-debian-elasticsearch7 env: - name: FLUENT_ELASTICSEARCH_HOST value: "elasticsearch.logging.svc.cluster.local" - name: FLUENT_ELASTICSEARCH_PORT value: "9200" resources: limits: memory: 512Mi requests: cpu: 100m memory: 200Mi volumeMounts: - name: varlog mountPath: /var/log - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true volumes: - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containers2.2 Loki 日志收集
apiVersion: apps/v1 kind: DaemonSet metadata: name: promtail namespace: logging spec: selector: matchLabels: app: promtail template: metadata: labels: app: promtail spec: serviceAccountName: promtail containers: - name: promtail image: grafana/promtail:latest args: - -config.file=/etc/promtail/config.yaml volumeMounts: - name: config mountPath: /etc/promtail - name: varlog mountPath: /var/log - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true volumes: - name: config configMap: name: promtail-config - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containersPromtail 配置:
apiVersion: v1 kind: ConfigMap metadata: name: promtail-config data: config.yaml: | server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://loki:3100/api/v1/push scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] target_label: app - source_labels: [__meta_kubernetes_namespace] target_label: namespace2.3 EFK 堆栈配置
# Elasticsearch StatefulSet apiVersion: apps/v1 kind: StatefulSet metadata: name: elasticsearch namespace: logging spec: serviceName: elasticsearch replicas: 3 selector: matchLabels: app: elasticsearch template: metadata: labels: app: elasticsearch spec: containers: - name: elasticsearch image: docker.elastic.co/elasticsearch/elasticsearch:8.5.0 resources: requests: memory: "4Gi" cpu: "2" env: - name: discovery.type value: "single-node" - name: ES_JAVA_OPTS value: "-Xms2g -Xmx2g" ports: - containerPort: 9200 name: http volumeMounts: - name: data mountPath: /usr/share/elasticsearch/data volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 100Gi三、日志管理最佳实践
3.1 日志格式标准化
apiVersion: v1 kind: ConfigMap metadata: name: fluentd-config data: fluent.conf: | <match **> @type elasticsearch host elasticsearch port 9200 logstash_format true logstash_prefix kubernetes include_tag_key true tag_key @log_name </match>结构化日志输出:
{ "timestamp": "2024-01-15T10:30:00Z", "level": "INFO", "logger": "app", "message": "User login successful", "request_id": "abc123", "user_id": "user-456", "response_time": 125, "status_code": 200 }3.2 日志保留策略
apiVersion: policy.k8s.io/v1 kind: PodDisruptionBudget metadata: name: elasticsearch-pdb spec: minAvailable: 2 selector: matchLabels: app: elasticsearch --- apiVersion: batch/v1 kind: CronJob metadata: name: curator namespace: logging spec: schedule: "0 2 * * *" jobTemplate: spec: template: spec: containers: - name: curator image: bobrik/curator:5.8 command: - curator - --config - /config/config.yml - /config/action_file.yml volumeMounts: - name: config mountPath: /config volumes: - name: config configMap: name: curator-config restartPolicy: OnFailureCurator 配置:
# config.yml client: hosts: - elasticsearch port: 9200 url_prefix: use_ssl: False certificate: client_cert: client_key: ssl_no_validate: False http_auth: timeout: 30 master_only: False logging: loglevel: INFO # action_file.yml actions: 1: action: delete_indices description: "Delete indices older than 30 days" options: ignore_empty_list: True timeout_override: continue_if_exception: False filters: - filtertype: pattern kind: prefix value: kubernetes- - filtertype: age source: name direction: older timestring: '%Y.%m.%d' unit: days unit_count: 303.3 日志访问控制
apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: log-reader namespace: logging rules: - apiGroups: [""] resources: ["pods/log"] verbs: ["get", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: log-reader-binding namespace: logging roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: log-reader subjects: - kind: User name: developer@example.com四、日志分析与查询
4.1 Kibana 查询示例
# 查询错误日志 level: ERROR AND @timestamp:[now-1h TO now] # 查询特定应用日志 app: "my-app" AND response_time > 500 # 查询认证失败 message: *"authentication failed"* # 聚合分析 GET /_search { "aggs": { "errors_by_app": { "terms": { "field": "app.keyword", "size": 10 }, "aggs": { "error_count": { "filter": { "term": { "level": "ERROR" } } } } } } }4.2 Loki 查询示例
# 查询 Pod 日志 {app="my-app", namespace="default"} |= "error" # 统计错误数 count_over_time({app="my-app"} |= "ERROR" [5m]) # 过滤时间范围 {namespace="kube-system"} | logfmt | level="error" # 正则匹配 {app="my-app"} |~ "authentication.*failed"4.3 Grafana 日志仪表板
{ "dashboard": { "title": "Kubernetes 日志分析", "panels": [ { "type": "logs", "target": { "expr": "{namespace=~\"$namespace\", app=~\"$app\"}", "refId": "A" }, "title": "实时日志" }, { "type": "graph", "target": "count_over_time({namespace=~\"$namespace\"} |= \"ERROR\" [5m])", "title": "错误率" }, { "type": "stat", "target": "sum(count_over_time({namespace=~\"$namespace\"}[5m]))", "title": "日志总量" } ], "templating": { "list": [ { "name": "namespace", "type": "query", "query": "label_values({__name__=\"namespace\"}, namespace)" }, { "name": "app", "type": "query", "query": "label_values({namespace=~\"$namespace\"}, app)" } ] } } }五、日志监控与告警
5.1 告警规则配置
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: log-alerts namespace: monitoring spec: groups: - name: log_rules rules: - alert: HighErrorRate expr: sum(count_over_time({app=~".+"} |= "ERROR"[5m])) by (app) > 10 for: 5m labels: severity: critical annotations: summary: "应用 {{ $labels.app }} 错误率过高" description: "过去5分钟内错误日志超过10条" - alert: LogVolumeHigh expr: sum by (namespace) (kube_pod_container_resource_requests_storage_bytes) > 100G for: 10m labels: severity: warning annotations: summary: "{{ $labels.namespace }} 日志存储过高" description: "日志存储已超过100GB" - alert: LogCollectionFailed expr: absent(promtail_scrape_samples_scraped_total[5m]) for: 5m labels: severity: critical annotations: summary: "日志收集失败" description: "Promtail 未收集到日志"5.2 异常检测
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: log-anomaly-detection spec: groups: - name: anomaly_rules rules: - record: log_anomaly_score expr: | (sum(count_over_time({app="my-app"}[1m])) / sum(count_over_time({app="my-app"}[1h]))) > 2 - alert: LogVolumeSpike expr: log_anomaly_score > 3 for: 2m labels: severity: warning annotations: summary: "{{ $labels.app }} 日志量突增" description: "日志量超过历史平均值3倍"六、日志安全与合规
6.1 日志加密
apiVersion: v1 kind: Secret metadata: name: elasticsearch-certificates type: Opaque data: tls.crt: <base64-encoded-cert> tls.key: <base64-encoded-key> --- apiVersion: apps/v1 kind: Deployment metadata: name: kibana namespace: logging spec: template: spec: containers: - name: kibana image: docker.elastic.co/kibana/kibana:8.5.0 env: - name: ELASTICSEARCH_HOSTS value: "https://elasticsearch:9200" - name: ELASTICSEARCH_USERNAME valueFrom: secretKeyRef: name: elasticsearch-credentials key: username - name: ELASTICSEARCH_PASSWORD valueFrom: secretKeyRef: name: elasticsearch-credentials key: password6.2 访问日志审计
apiVersion: v1 kind: ConfigMap metadata: name: nginx-config data: nginx.conf: | http { log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for" ' '$request_time $upstream_response_time'; access_log /var/log/nginx/access.log main; }七、常见问题与解决方案
7.1 日志丢失
问题分析:
- Pod 重启导致容器日志丢失
- 日志收集器配置错误
- 存储容量不足
解决方案:
# 配置持久化存储 volumes: - name: varlog persistentVolumeClaim: claimName: log-storage7.2 日志查询慢
问题分析:
- 索引过多
- 查询条件不合理
- 存储性能不足
解决方案:
# 配置索引生命周期管理 apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: quickstart spec: nodeSets: - name: default count: 3 config: node.store.allow_mmap: false volumeClaimTemplates: - metadata: name: elasticsearch-data spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: fast7.3 日志敏感信息泄露
问题分析:
- 日志中包含密码、token 等敏感信息
- 日志未脱敏处理
解决方案:
# Fluentd 脱敏配置 <filter **> @type record_transformer enable_ruby true <record> message ${record["message"].gsub(/(password|token)=[^&]+/, '\1=***')} </record> </filter>结论
Kubernetes 日志管理是集群运维的重要组成部分。通过合理的日志收集架构、标准化的日志格式、完善的存储策略和智能的分析工具,可以构建一个高效、可靠的日志管理系统。同时,结合安全合规要求和持续优化,能够更好地支持故障排查和业务分析需求。