Pacemaker 集群搭建与高可用Web服务实战-编程阁

1. 为什么需要Pacemaker集群？

想象一下你运营着一个电商网站，突然服务器宕机了，所有用户都无法下单。这种情况每年会给企业带来数百万的损失。这就是为什么我们需要高可用集群——确保服务永不中断。

Pacemaker作为Linux生态中最成熟的开源集群资源管理器，能够实现：

自动故障检测与恢复：当主节点宕机时，备节点能在秒级完成接管
资源智能调度：根据节点负载自动平衡服务
零单点故障：通过多节点协作消除系统脆弱点

我经手的一个金融项目使用Pacemaker后，系统可用性从99.9%提升到99.99%，相当于每年故障时间从8小时缩短到52分钟。这种提升对关键业务来说价值巨大。

2. 环境准备与基础配置

2.1 硬件与网络规划

建议使用两台配置相近的服务器（物理机或虚拟机），网络拓扑要特别注意：

心跳网络：用于节点间通信，建议使用独立网卡和交换机
业务网络：承载VIP和实际服务流量
存储网络：如果使用共享存储需要额外规划

这里我们用两台CentOS 7虚拟机演示：

node1: 192.168.139.87
node2: 192.168.139.88
VIP: 192.168.139.118

提示：生产环境强烈建议使用3节点以上集群，双节点存在脑裂风险

2.2 系统基础配置

首先在所有节点执行这些基础操作：

# 关闭防火墙和SELinux systemctl stop firewalld systemctl disable firewalld setenforce 0 # 永久关闭SELinux sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config # 安装集群组件 yum install -y fence-agents-all corosync pacemaker pcs

我遇到过不少案例因为SELinux没彻底关闭导致集群异常，建议双重确认配置已生效。

3. 集群核心组件搭建

3.1 节点通信配置

集群节点需要通过主机名互相解析，修改/etc/hosts文件：

# node1和node2上都要配置 cat >> /etc/hosts <<EOF 192.168.139.87 node1 192.168.139.88 node2 EOF # 验证解析 ping -c 3 node2

然后配置SSH免密登录：

# 生成密钥对 ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa # 互相分发公钥 ssh-copy-id node1 ssh-copy-id node2

3.2 集群用户与认证

Pacemaker使用hacluster用户进行管理：

# 设置统一密码 echo "YourSecurePassword" | passwd --stdin hacluster # 启动pcsd服务 systemctl start pcsd systemctl enable pcsd # 节点认证 pcs cluster auth node1 node2 -u hacluster -p YourSecurePassword

4. Web服务高可用实战

4.1 Apache服务配置

在两台节点安装配置Apache：

yum install -y httpd # 创建测试页面 echo "This is node1" > /var/www/html/index.html # node1上执行 echo "This is node2" > /var/www/html/index.html # node2上执行 # 配置状态检测 cat >> /etc/httpd/conf/httpd.conf <<EOF Listen 0.0.0.0:80 ServerName www.demo.com <Location /server-status> SetHandler server-status Require all granted </Location> EOF # 注意：先不要启动httpd服务！

4.2 集群资源管理

创建并启动集群：

# 初始化集群 pcs cluster setup --name web_cluster node1 node2 pcs cluster start --all pcs cluster enable --all # 添加VIP资源 pcs resource create ClusterVIP ocf:heartbeat:IPaddr2 \ ip=192.168.139.118 cidr_netmask=24 \ op monitor interval=30s # 添加Web资源 pcs resource create WebService ocf:heartbeat:apache \ configfile=/etc/httpd/conf/httpd.conf \ statusurl="http://localhost/server-status" \ op monitor interval=30s

资源约束配置是关键：

# 资源组方式（推荐新手） pcs resource group add WebGroup ClusterVIP WebService # 或者使用约束方式 pcs constraint colocation add WebService ClusterVIP INFINITY pcs constraint order ClusterVIP then WebService

5. 故障转移测试与优化

5.1 手动模拟故障

首先查看当前资源运行节点：

pcs status | grep -E "ClusterVIP|WebService"

然后主动关闭主节点服务：

# 在主节点执行 pcs cluster stop node1 # 在备节点查看接管情况 pcs status curl http://192.168.139.118

5.2 自动化调优建议

根据业务需求调整检测参数：

# 更敏感的心跳检测 pcs resource update ClusterVIP op monitor interval=10s timeout=20s # 设置故障恢复策略 pcs resource defaults resource-stickiness=100 pcs property set start-failure-is-fatal=false

我在生产环境发现这些参数组合效果最佳：

检测间隔：10-30秒
超时时间：检测间隔的2倍
故障重试次数：3次

6. 生产环境注意事项

实际部署时还需要考虑：

** fencing设备配置**：防止脑裂现象
DRBD存储同步：保证数据一致性
监控集成：与Prometheus等监控系统对接
日志收集：统一分析集群事件

曾经有个客户集群频繁发生误切换，最后发现是网络抖动导致。我们通过调整corosync的token超时时间解决了问题：

# 调整心跳超时参数 pcs property set token=30000 pcs property set consensus=36000

记住，任何高可用方案都需要定期演练。我建议至少每季度做一次完整的故障转移测试，包括模拟硬件故障、网络分区等极端情况。

Pacemaker 集群搭建与高可用Web服务实战