九、CentOS6.4+Corosync+Pacemaker 实现高可用的Web集群
1.环境说明
(1).操作系统
CentOS 6.4 X86_64 位系统
(2).软件环境
Corosync 1.4.1
Pacemaker 1.1.8
crmsh 1.2.6
(3).拓扑准备
2.Corosync与Pacemaker 安装与配置
Corosync与Pacemaker安装与配置我就不在这里重复说明了,大家参考一下这篇博文: (Linux 高可用(HA)集群之Corosync详解)
3.Pacemaker 配置资源方法
(1).命令配置方式
crmsh
pcs
(2).图形配置方式
pygui
hawk
LCMC
pcs
注:本文主要的讲解的是crmsh
4.crmsh 简单说明
注:以下上pacemaker 1.1.8的更新说明,最重要的我用红色标记出来,从pacemaker 1.1.8开始,crm sh 发展成一个独立项目,pacemaker中不再提供,说明我们安装好pacemaker后,是不会有crm这个命令行模式的资源管理器的。
[root@node1 ~]# cd /usr/share/doc/pacemaker-1.1.8/
[root@node1 pacemaker-1.1.8]# ll
总用量 132
-rw-r--r-- 1 root root 1102 2月 22 13:05 AUTHORS
-rw-r--r-- 1 root root 109311 2月 22 13:05 ChangeLog
-rw-r--r-- 1 root root 18046 2月 22 13:05 COPYING
[root@node1 pacemaker-1.1.8]# vim ChangeLog
* Thu Sep 20 2012 Andrew Beekhof <andrew@beekhof.net> Pacemaker-1.1.8-1
- Update source tarball to revision: 1a5341f
- Statistics:
Changesets: 1019
Diff: 2107 files changed, 117258 insertions(+), 73606 deletions(-)
- All APIs have been cleaned up and reduced to essentials
- Pacemaker now includes a replacement lrmd that supports systemd and upstart agents
- Config and state files (cib.xml, PE inputs and core files) have moved to new locations
- The crm shell has become a separate project and no longer included with Pacemaker (crm shell 已成为一个独立的项目,pacemaker中一再提供)
- All daemons/tools now have a unified set of error codes based on errno.h (see crm_error)
[root@node1 ~]# crm
crmadmin crm_diff crm_failcount crm_mon crm_report crm_shadow crm_standby crm_verify
crm_attribute crm_error crm_master crm_node crm_resource crm_simulate crm_ticket
注:大家可以看到,安装好pacemaker后,就没有crm shell命令行工具,我们得单独安装。下面我们就来说说怎么安装crm sh
5.安装crmsh资源管理工具
(1).crmsh官方网站
https://savannah.nongnu.org/forum/forum.php?forum_id=7672
(2).crmsh下载地址
SUSE.org/repositories/network:/ha-clustering:/Stable/
(3).安装crmsh
[root@node1 ~]# rpm -ivh crmsh-1.2.6-0.rc2.2.1.x86_64.rpm
warning: crmsh-1.2.6-0.rc2.2.1.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 7b709911: NOKEY
error: Failed dependencies:
pssh is needed by crmsh-1.2.6-0.rc2.2.1.x86_64
python-dateutil is needed by crmsh-1.2.6-0.rc2.2.1.x86_64
python-lxml is needed by crmsh-1.2.6-0.rc2.2.1.x86_64
注:大家可以看到年缺少依赖包,我们先用yum安装依赖包
[root@node1 ~]# yum install -y python-dateutil python-lxml
[root@node1 ~]# rpm -ivh crmsh-1.2.6-0.rc2.2.1.x86_64.rpm --nodeps
warning: crmsh-1.2.6-0.rc2.2.1.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 7b709911: NOKEY
Preparing... ########################################### [100%]
1:crmsh ########################################### [100%]
[root@node1 ~]# crm #安装好后出现一个crm命令,说明安装完成
crm crm_attribute crm_error crm_master crm_node crm_resource crm_simulate crm_ticket
crmadmin crm_diff crm_failcount crm_mon crm_report crm_shadow crm_standby crm_verify
[root@node1 ~]# crm #输入crm命令,进入资源配置模式
Cannot change active directory to /var/lib/pacemaker/cores/root: No such file or directory (2)
crm(live)# help #查看一下帮助
This is crm shell, a Pacemaker command line interface.
Available commands:
cib manage shadow CIBs
resource resources management
configure CRM cluster configuration
node nodes management
options user preferences
history CRM cluster history
site Geo-cluster support
ra resource agents information center
status show cluster status
help,? show help (help topics for list of topics)
end,cd,up go back one level
quit,bye,exit exit the program
crm(live)#
注:到此准备工作全部完成,下面我们来具体配置一下高可用的Web集群,在配置之前我们还得简的说明一下,crm sh 如何使用!
6.crmsh使用说明
注:简单说明一下,其实遇到一个新命令最好的方法就是man一下!简单的先熟悉一下这个命令,然后再慢慢尝试。
[root@node1 ~]# crm #输入crm命令,进入crm sh 模式
Cannot change active directory to /var/lib/pacemaker/cores/root: No such file or directory (2)
crm(live)# help #输入help查看一下,会出下很多子命令
This is crm shell, a Pacemaker command line interface.
Available commands:
cib manage shadow CIBs
resource resources management
configure CRM cluster configuration
node nodes management
options user preferences
history CRM cluster history
site Geo-cluster support
ra resource agents information center
status show cluster status
help,? show help (help topics for list of topics)
end,cd,up go back one level
quit,bye,exit exit the program
crm(live)# configure #输入configure就会进入,configure模式下,
crm(live)configure# #敲两下tab键就会显示configure下全部命令
? default-timeouts group node rename simulate
bye delete help op_defaults role template
cd edit history order rsc_defaults up
cib end load primitive rsc_template upgrade
cibstatus erase location property rsc_ticket user
clone exit master ptest rsctest verify
collocation fencing_topology modgroup quit save xml
colocation filter monitor ra schema
commit graph ms refresh show
crm(live)configure# help node #输入help加你想了解的任意命令,就会显示该命令的使用帮助与案例
The node command describes a cluster node. Nodes in the CIB are
commonly created automatically by the CRM. Hence, you should not
need to deal with nodes unless you also want to define node
attributes. Note that it is also possible to manage node
attributes at the `node` level.
Usage:
...............
node <uname>[:<type>]
[attributes <param>=<value> [<param>=<value>...]]
[utilization <param>=<value> [<param>=<value>...]]
type :: normal | member | ping
...............
Example:
...............
node node1
node big_node attributes memory=64
...............
注:好了,简单说明就到这,其实就是一句话,不会的命令help一下。下面我们开始配置,高可用的Web集群。
7.crmsh 配置高可用的Web集群
(1).查看一下默认配置
[root@node1 ~]# crm
Cannot change active directory to /var/lib/pacemaker/cores/root: No such file or directory (2)
crm(live)# configure
crm(live)configure# show
node node1.test.com
node node2.test.com
property $id="cib-bootstrap-options" \
dc-version="1.1.8-7.el6-394e906" \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes="2" \
(2).检测一下配置文件是否有错
crm(live)# configure
crm(live)configure# verify
crm_verify[5202]: 2011/06/14_19:10:38 ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
crm_verify[5202]: 2011/06/14_19:10:38 ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
crm_verify[5202]: 2011/06/14_19:10:38 ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
注:说我们的STONITH resources没有定义,因我们这里没有STONITH设备,所以我们先关闭这个属性
crm(live)configure# property stonith-enabled=false
crm(live)configure# show
node node1.test.com
node node2.test.com
property $id="cib-bootstrap-options" \
dc-version="1.1.8-7.el6-394e906" \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes="2" \
stonith-enabled="false"
crm(live)configure# verify #现在已经不报错
(3).查看当前集群系统所支持的类型
crm(live)# ra
crm(live)ra# classes
lsb
ocf / heartbeat pacemaker RedHat
service
stonith
(4).查看某种类别下的所用资源代理的列表
crm(live)ra# list lsb
auditd blk-availability corosync corosync-notifyd crond halt
htcacheclean httpd ip6tables iptables killall lvm2-lvmetad
lvm2-monitor messagebus netconsole netfs network nfs
nfslock ntpd ntpdate pacemaker postfix quota_nld
rdisc restorecond rpcbind rpcgssd rpcidmapd rpcsvcgssd
rsyslog sandbox saslauthd single sshd svnserve
udev-post winbind
crm(live)ra# list ocf heartbeat
AoEtarget AudibleAlarm CTDB ClusterMon Delay Dummy
EvmsSCC Evmsd Filesystem ICP IPaddr IPaddr2
IPsrcaddr IPv6addr LVM LinuxSCSI MailTo ManageRAID
ManageVE Pure-FTPd Raid1 Route SAPDatabase SAPInstance
SendArp ServeRAID SphinxSearchDaemon Squid Stateful SysInfo
VIPArip VirtualDomain WAS WAS6 WinPopup Xen
Xinetd anything apache conntrackd db2 drbd
eDir88 ethmonitor exportfs fio iSCSILogicalUnit iSCSITarget
ids iscsi jboss lxc mysql mysql-proxy
nfsserver nginx Oracle oralsnr pgsql pingd
portblock postfix proftpd rsyncd scsi2reservation sfex
symlink syslog-ng tomcat vmware
crm(live)ra# list ocf pacemaker
ClusterMon Dummy HealthCPU HealthSMART Stateful SysInfo SystemHealth controld
o2cb ping pingd
(5).查看某个资源代理的配置方法
crm(live)ra# info ocf:heartbeat:IPaddr
Manages virtual IPv4 addresses (portable version) (ocf:heartbeat:IPaddr)
This script manages IP alias IP addresses
It can add an IP alias, or remove one.
Parameters (* denotes required, [] the default):
ip* (string): IPv4 address
The IPv4 address to be configured in dotted quad notation, for example
"192.168.1.1".
nic (string, [eth0]): Network interface
The base network interface on which the IP address will be brought
online.
If left empty, the script will try and determine this from the
routing table.
Do NOT specify an alias interface in the form eth0:1 or anything here;
rather, specify the base interface only.
Prerequisite:
There must be at least one static IP address, which is not managed by
the cluster, assigned to the network interface.
If you can not assign any static IP address on the interface,
:
(6).接下来要创建的web集群创建一个IP地址资源(IP资源是主资源,我们查看一下怎么定义一个主资源)
crm(live)# configure
crm(live)configure# primitive
usage: primitive <rsc> {[<class>:[<provider>:]]<type>|@<template>}
[params <param>=<value> [<param>=<value>...]]
[meta <attribute>=<value> [<attribute>=<value>...]]
[utilization <attribute>=<value> [<attribute>=<value>...]]
[operations id_spec
[op op_type [<attribute>=<value>...] ...]]
crm(live)configure# primitive vip ocf:heartbeat:IPaddr params ip=192.168.18.200 nic=eth0 cidr_netmask=24 #增加一个VIP资源
crm(live)configure# show #查看已增加好的VIP,我用红色标记了一下
node node1.test.com
node node2.test.com
primitive vip ocf:heartbeat:IPaddr \
params ip="192.168.18.200" nic="eth0" cidr_netmask="24"
property $id="cib-bootstrap-options" \
dc-version="1.1.8-7.el6-394e906" \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes="2" \
stonith-enabled="false"
crm(live)configure# verify #检查一下配置文件有没有错误
crm(live)configure# commit #提交配置的资源,在命令行配置资源时,只要不用commit提交配置好资源,就不会生效,一但用commit命令提交,就会写入到cib.xml的配置文件中
crm(live)# status #查看一下配置好的资源状态,有一个资源vip,运行在node1上
Last updated: Thu Aug 15 14:24:45 2013
Last change: Thu Aug 15 14:21:21 2013 via cibadmin on node1.test.com
Stack: classic openais (with plugin)
Current DC: node1.test.com - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 2 expected votes
1 Resources configured.
Online: [ node1.test.com node2.test.com ]
vip (ocf::heartbeat:IPaddr): Started node1.test.com
查看一下node1节点上的ip,大家可以看到vip已经生效,而后我们到node2上通过如下命令停止node1上的corosync服务,再查看状态
[root@node1 ~]# ifconfig
eth0 Link encap:Ethernet HWaddr 00:0C:29:91:45:90
inet addr:192.168.18.201 Bcast:192.168.18.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fe91:4590/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:375197 errors:0 dropped:0 overruns:0 frame:0
TX packets:291575 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:55551264 (52.9 MiB) TX bytes:52697225 (50.2 MiB)
eth0:0 Link encap:Ethernet HWaddr 00:0C:29:91:45:90
inet addr:192.168.18.200 Bcast:192.168.18.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:6473 errors:0 dropped:0 overruns:0 frame:0
TX packets:6473 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:875395 (854.8 KiB) TX bytes:875395 (854.8 KiB)
测试,停止node1节点上的corosync,可以看到node1已经离线
[root@node2 ~]# ssh node1 "service corosync stop"
Signaling Corosync Cluster Engine (corosync) to terminate: [确定]
Waiting for corosync services to unload:..[确定]
[root@node2 ~]# crm status
Cannot change active directory to /var/lib/pacemaker/cores/root: No such file or directory (2)
Last updated: Thu Aug 15 14:29:04 2013
Last change: Thu Aug 15 14:21:21 2013 via cibadmin on node1.test.com
Stack: classic openais (with plugin)
Current DC: node2.test.com - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 2 expected votes
1 Resources configured.
Online: [ node2.test.com ]
OFFLINE: [ node1.test.com ]
重点说明:上面的信息显示node1.test.com已经离线,但资源vip却没能在node2.test.com上启动。这是因为此时的集群状态为"WITHOUT quorum"(红色标记),即已经失去了quorum,此时集群服务本身已经不满足正常运行的条件,这对于只有两节点的集群来讲是不合理的。因此,我们可以通过如下的命令来修改忽略quorum不能满足的集群状态检查:property no-quorum-policy=ignore
crm(live)# configure
crm(live)configure# property no-quorum-policy=ignore
crm(live)configure# show
node node1.test.com
node node2.test.com
primitive vip ocf:heartbeat:IPaddr \
params ip="192.168.18.200" nic="eth0" cidr_netmask="24"
property $id="cib-bootstrap-options" \
dc-version="1.1.8-7.el6-394e906" \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore"
crm(live)configure# verify
crm(live)configure# commit
片刻之后,集群就会在目前仍在运行中的节点node2上启动此资源了,如下所示:
[root@node2 ~]# crm status
Cannot change active directory to /var/lib/pacemaker/cores/root: No such file or directory (2)
Last updated: Thu Aug 15 14:38:23 2013
Last change: Thu Aug 15 14:37:08 2013 via cibadmin on node2.test.com
Stack: classic openais (with plugin)
Current DC: node2.test.com - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 2 expected votes
1 Resources configured.
Online: [ node2.test.com ]
OFFLINE: [ node1.test.com ]
vip (ocf::heartbeat:IPaddr): Started node2.test.com
好了,验正完成后,我们正常启动node1.test.com
[root@node2 ~]# ssh node1 "service corosync start"
Starting Corosync Cluster Engine (corosync): [确定]
[root@node2 ~]# crm status
Cannot change active directory to /var/lib/pacemaker/cores/root: No such file or directory (2)
Last updated: Thu Aug 15 14:39:45 2013
Last change: Thu Aug 15 14:37:08 2013 via cibadmin on node2.test.com
Stack: classic openais (with plugin)
Current DC: node2.test.com - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 2 expected votes
1 Resources configured.
Online: [ node1.test.com node2.test.com ]
vip (ocf::heartbeat:IPaddr): Started node2.test.com
[root@node2 ~]#
正常启动node1.test.com后,集群资源vip很可能会重新从node2.test.com转移回node1.test.com,但也可能不回去。资源的这种在节点间每一次的来回流动都会造成那段时间内其无法正常被访问,所以,我们有时候需要在资源因为节点故障转移到其它节点后,即便原来的节点恢复正常也禁止资源再次流转回来。这可以通过定义资源的黏性(stickiness)来实现。在创建资源时或在创建资源后,都可以指定指定资源黏性。好了,下面我们来简单回忆一下,资源黏性。