故障诊断

活动节点在一个 HA 集群中不能正常运作有多种原因,自愿的或非自愿的。节点可能丢失与其他节点的网络连接,heartbeat 进程可能被停止,可能出现任何环境事件等等。要故意地让活动节点失效,您可以要求该节点暂停,或将其设为备用模式,方法是使用 hb_gui(完全关闭)命令。如果您希望测试环境的健壮性,您可能需要一些更激进的方式。

指示器和故障恢复

系统管理员可以使用两种日志文件指示器配置 Linux HA heartbeat 系统。日志文件与系统是否为浮动资源 IP 地址的接收方有关。没有接收浮动资源 IP 地址的集群成员的日志结果类似于:

清单 10. 落选者的日志结果

litsha21:~ # cat /var/log/messages

Jan 16 12:00:20 litsha21 heartbeat: [3057]: WARN: node litsha23: is dead

Jan 16 12:00:21 litsha21 cib: [3065]: info: mem_handle_event: Got an event

OC_EV_MS_NOT_PRIMARY from ccm

Jan 16 12:00:21 litsha21 cib: [3065]: info: mem_handle_event: instance=13, nodes=3,

new=1, lost=0, n_idx=0, new_idx=3, old_idx=6

Jan 16 12:00:21 litsha21 crmd: [3069]: info: mem_handle_event: Got an event

OC_EV_MS_NOT_PRIMARY from ccm

Jan 16 12:00:21 litsha21 crmd: [3069]: info: mem_handle_event: instance=13, nodes=3,

new=1, lost=0, n_idx=0, new_idx=3, old_idx=6

Jan 16 12:00:21 litsha21 crmd: [3069]: info: crmd_ccm_msg_callback:callbacks.c Quorum

lost after event=NOT PRIMARY (id=13)

Jan 16 12:00:21 litsha21 heartbeat: [3057]: info: Link litsha23:eth1 dead.

Jan 16 12:00:38 litsha21 ccm: [3064]: debug: quorum plugin: majority

Jan 16 12:00:38 litsha21 ccm: [3064]: debug: cluster:linux-ha, member_count=2,

member_quorum_votes=200

Jan 16 12:00:38 litsha21 ccm: [3064]: debug: total_node_count=3,

total_quorum_votes=300

……………… Truncated For Brevity ………………

Jan 16 12:00:40 litsha21 crmd: [3069]: info: update_dc:utils.c Set DC to litsha21

(1.0.6)

Jan 16 12:00:41 litsha21 crmd: [3069]: info: do_state_transition:fsa.c litsha21:

State transition S_INTEGRATION ->

S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL

origin=check_join_state ]

Jan 16 12:00:41 litsha21 crmd: [3069]: info: do_state_transition:fsa.c All 2 cluster

nodes responded to the join offer.

Jan 16 12:00:41 litsha21 crmd: [3069]: info: update_attrd:join_dc.c Connecting to

attrd…

Jan 16 12:00:41 litsha21 cib: [3065]: info: sync_our_cib:messages.c Syncing CIB to

all peers

Jan 16 12:00:41 litsha21 attrd: [3068]: info: attrd_local_callback:attrd.c Sending

full refresh

……………… Truncated For Brevity ………………

Jan 16 12:00:43 litsha21 pengine: [3112]: info: unpack_nodes:unpack.c Node litsha21

is in standby-mode

Jan 16 12:00:43 litsha21 pengine: [3112]: info: determine_online_status:unpack.c Node

litsha21 is online

Jan 16 12:00:43 litsha21 pengine: [3112]: info: determine_online_status:unpack.c Node

litsha22 is online

Jan 16 12:00:43 litsha21 pengine: [3112]: info: IPaddr_1

(heartbeat::ocf:IPaddr): Stopped

Jan 16 12:00:43 litsha21 pengine: [3112]: notice: StartRsc:native.c litsha22

Start IPaddr_1

Jan 16 12:00:43 litsha21 pengine: [3112]: notice: Recurring:native.c litsha22

IPaddr_1_monitor_5000

Jan 16 12:00:43 litsha21 pengine: [3112]: notice: stage8:stages.c Created transition

graph 0.

……………… Truncated For Brevity ………………

Jan 16 12:00:46 litsha21 mgmtd: [3070]: debug: update cib finished

Jan 16 12:00:46 litsha21 crmd: [3069]: info: do_state_transition:fsa.c litsha21:

State transition S_TRANSITION_ENGINE ->

S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=do_msg_route ]

Jan 16 12:00:46 litsha21 cib: [3118]: info: write_cib_contents:io.c Wrote version

0.53.593 of the CIB to disk (digest: 83b00c386e8b67c42d033a4141aaef90)

如清单 10 所示,获取了一份名单,有足够的 quorum 成员可以投票。获取投票,不需要执行其他操作即可恢复正常运作。

相反,接收了浮动资源 IP 地址的集群成员的日志结果如下:

清单 11. 资源持有者的日志文件

litsha22:~ # cat /var/log/messages

Jan 16 12:00:06 litsha22 syslog-ng[1276]: STATS: dropped 0

Jan 16 12:01:51 litsha22 heartbeat: [3892]: WARN: node litsha23: is dead

Jan 16 12:01:51 litsha22 heartbeat: [3892]: info: Link litsha23:eth1 dead.

Jan 16 12:01:51 litsha22 cib: [3900]: info: mem_handle_event: Got an event

OC_EV_MS_NOT_PRIMARY from ccm

Jan 16 12:01:51 litsha22 cib: [3900]: info: mem_handle_event: instance=13, nodes=3,

new=3, lost=0, n_idx=0, new_idx=0, old_idx=6

Jan 16 12:01:51 litsha22 crmd: [3904]: info: mem_handle_event: Got an event

OC_EV_MS_NOT_PRIMARY from ccm

Jan 16 12:01:51 litsha22 crmd: [3904]: info: mem_handle_event: instance=13, nodes=3,

new=3, lost=0, n_idx=0, new_idx=0, old_idx=6

Jan 16 12:01:51 litsha22 crmd: [3904]: info: crmd_ccm_msg_callback:callbacks.c Quorum

lost after event=NOT PRIMARY (id=13)

Jan 16 12:02:09 litsha22 ccm: [3899]: debug: quorum plugin: majority

Jan 16 12:02:09 litsha22 crmd: [3904]: info: do_election_count_vote:election.c

Election check: vote from litsha21

Jan 16 12:02:09 litsha22 ccm: [3899]: debug: cluster:linux-ha, member_count=2,

member_quorum_votes=200

Jan 16 12:02:09 litsha22 ccm: [3899]: debug: total_node_count=3,

total_quorum_votes=300

Jan 16 12:02:09 litsha22 cib: [3900]: info: mem_handle_event: Got an event

OC_EV_MS_INVALID from ccm

Jan 16 12:02:09 litsha22 cib: [3900]: info: mem_handle_event: no mbr_track info

Jan 16 12:02:09 litsha22 cib: [3900]: info: mem_handle_event: Got an event

OC_EV_MS_NEW_MEMBERSHIP from ccm

Jan 16 12:02:09 litsha22 cib: [3900]: info: mem_handle_event: instance=14, nodes=2,

new=0, lost=1, n_idx=0, new_idx=2, old_idx=5

Jan 16 12:02:09 litsha22 cib: [3900]: info: cib_ccm_msg_callback:callbacks.c

LOST: litsha23

Jan 16 12:02:09 litsha22 cib: [3900]: info: cib_ccm_msg_callback:callbacks.c

PEER: litsha21

Jan 16 12:02:09 litsha22 cib: [3900]: info: cib_ccm_msg_callback:callbacks.c

PEER: litsha22

……………… Truncated For Brevity ………………

Jan 16 12:02:12 litsha22 crmd: [3904]: info: update_dc:utils.c Set DC to litsha21

(1.0.6)

Jan 16 12:02:12 litsha22 crmd: [3904]: info: do_state_transition:fsa.c litsha22:

State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE

origin=do_cl_join_finalize_respond ]

Jan 16 12:02:12 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Update (client:

3069, call:25): 0.52.585 ->

0.52.586 (ok)

……………… Truncated For Brevity ………………

Jan 16 12:02:14 litsha22 IPaddr[3998]: INFO: /sbin/ifconfig eth0:0 192.168.71.205

netmask 255.255.255.0 broadcast 192.168.71.255

Jan 16 12:02:14 litsha22 IPaddr[3998]: INFO: Sending Gratuitous Arp for

192.168.71.205 on eth0:0 [eth0]

Jan 16 12:02:14 litsha22 IPaddr[3998]: INFO: /usr/lib64/heartbeat/send_arp -i 500 -r

10 -p

/var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.71.205 eth0 192.168.71.205 auto

192.168.71.205 ffffffffffff

Jan 16 12:02:14 litsha22 crmd: [3904]: info: process_lrm_event:lrm.c LRM operation

(3) start_0 on IPaddr_1 complete

Jan 16 12:02:14 litsha22 kernel: send_arp uses obsolete (PF_INET,SOCK_PACKET)

Jan 16 12:02:14 litsha22 kernel: klogd 1.4.1, ———- state change ———-

Jan 16 12:02:14 litsha22 kernel: NET: Registered protocol family 17

Jan 16 12:02:15 litsha22 crmd: [3904]: info: do_lrm_rsc_op:lrm.c Performing op

monitor on IPaddr_1 (interval=5000ms, key=0:f9d962f0-4ed6-462d-a28d-e27b6532884c)

Jan 16 12:02:15 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Update (client:

3904, call:18): 0.53.591 ->

0.53.592

(ok)

Jan 16 12:02:15 litsha22 mgmtd: [3905]: debug: update cib finished

如清单 11 所示,/var/log/messages 文件展示了此节点已经获得了浮动资源。ifconfig 行显示将动态创建 eth0:0 设备以维持服务。

如清单 11 所示,获取了一份名单,有足够的 quorum 成员可以投票。获取投票后,发出的 ifconfig 命令索要浮动资源 IP 地址。

另外一种指示何时发生故障的方法是,您可以登录到任何一个集群成员并执行 hb_gui 命令。使用这种方法,您可以通过亲自查找拥有浮动资源的系统来确定故障发生时间。

最后,如果我们不展示一个非 quorum 情形的样例日志文件似乎有些说不过去。如果某个单一节点无法与任何一个其他节点进行通信,那么它就丢失了 quorum(因为在 3 个成员的投票中 2/3 是多数)。在这种情况下,节点获悉已丢失 quorum,然后调用 no quorum policy 处理程序。清单 12 展示了一个来自这类事件的日志文件示例。丢失 quorum 时,日志条目将会有所显示。显示此日志条目的集群节点认为自己不具有浮动资源。ifconfig down 语句将释放资源。

清单 12. 显示丢失 quorum 的日志条目

litsha22:~ # cat /var/log/messages

………………..

Jan 16 12:06:12 litsha22 ccm: [3899]: debug: quorum plugin: majority

Jan 16 12:06:12 litsha22 ccm: [3899]: debug: cluster:linux-ha, member_count=1,

member_quorum_votes=100

Jan 16 12:06:12 litsha22 ccm: [3899]: debug: total_node_count=3,

total_quorum_votes=300

……………… Truncated For Brevity ………………

Jan 16 12:06:12 litsha22 crmd: [3904]: info: crmd_ccm_msg_callback:callbacks.c Quorum

lost after event=INVALID (id=15)

Jan 16 12:06:12 litsha22 crmd: [3904]: WARN: check_dead_member:ccm.c Our DC node

(litsha21) left the cluster

……………… Truncated For Brevity ………………

Jan 16 12:06:14 litsha22 IPaddr[5145]: INFO: /sbin/route -n del -host 192.168.71.205

Jan 16 12:06:15 litsha22 lrmd: [1619]: info: RA output: (IPaddr_1:stop:stderr)

SIOCDELRT: No such process

Jan 16 12:06:15 litsha22 IPaddr[5145]: INFO: /sbin/ifconfig eth0:0 192.168.71.205

down

Jan 16 12:06:15 litsha22 IPaddr[5145]: INFO: IP Address 192.168.71.205 released

Jan 16 12:06:15 litsha22 crmd: [3904]: info: process_lrm_event:lrm.c LRM operation

(6) stop_0 on IPaddr_1 complete

Jan 16 12:06:15 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Update (client:

3904, call:32): 0.54.599 ->

0.54.600 (ok)

Jan 16 12:06:15 litsha22 mgmtd: [3905]: debug: update cib finished

如清单 12 所示,当某个给定节点丢失 quorum 时,它将放弃所有资源,因为选择了 no quorum policy 配置。是否选择 no quorum policy 完全取决于您自己。

Fail-back 动作和消息

正确配置的 Linux HA 系统的一个更有趣的含义是,您不需要执行任何操作就可重新实例化集群成员。只需要激活 Linux 实例即可让节点自动地重新加入其他节点。如果您配置了一个主节点(即,一个可以优先于其他节点获得浮动资源的节点),它将自动重新获得浮动资源。非优先的 系统将只是加入合格节点池并正常运作。

将其他的 Linux 实例往回添加到池中将使每个节点获得通知,并且如果可能的话,重新建立 quorum。如果可以重新建立 quorum,那么将在某个节点上重新建立浮动资源。

清单 13. 重新建立 Quorum

litsha22:~ # tail -f /var/log/messages

Jan 16 12:09:02 litsha22 heartbeat: [3892]: info: Heartbeat restart on node litsha21

Jan 16 12:09:02 litsha22 heartbeat: [3892]: info: Link litsha21:eth1 up.

Jan 16 12:09:02 litsha22 heartbeat: [3892]: info: Status update for node litsha21:

status init

Jan 16 12:09:02 litsha22 heartbeat: [3892]: info: Status update for node litsha21:

status up

Jan 16 12:09:22 litsha22 heartbeat: [3892]: debug: get_delnodelist: delnodelist=

Jan 16 12:09:22 litsha22 heartbeat: [3892]: info: Status update for node litsha21:

status active

Jan 16 12:09:22 litsha22 cib: [3900]: info: cib_client_status_callback:callbacks.c

Status update: Client litsha21/cib now has status [join]

Jan 16 12:09:23 litsha22 heartbeat: [3892]: WARN: 1 lost packet(s) for [litsha21]

[36:38]

Jan 16 12:09:23 litsha22 heartbeat: [3892]: info: No pkts missing from litsha21!

Jan 16 12:09:23 litsha22 crmd: [3904]: notice:

crmd_client_status_callback:callbacks.c Status update: Client litsha21/crmd now has

status [online]

………………..

Jan 16 12:09:31 litsha22 crmd: [3904]: info: crmd_ccm_msg_callback:callbacks.c Quorum

(re)attained after event=NEW MEMBERSHIP (id=16)

Jan 16 12:09:31 litsha22 crmd: [3904]: info: ccm_event_detail:ccm.c NEW MEMBERSHIP:

trans=16, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, old_idx=5

Jan 16 12:09:31 litsha22 crmd: [3904]: info: ccm_event_detail:ccm.c CURRENT:

litsha22 [nodeid=1, born=13]

Jan 16 12:09:31 litsha22 crmd: [3904]: info: ccm_event_detail:ccm.c CURRENT:

litsha21 [nodeid=0, born=16]

Jan 16 12:09:31 litsha22 crmd: [3904]: info: ccm_event_detail:ccm.c NEW:

litsha21 [nodeid=0, born=16]

Jan 16 12:09:31 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Local-only

Change (client:3904, call: 35):

0.54.600 (ok)

Jan 16 12:09:31 litsha22 mgmtd: [3905]: debug: update cib finished

………………..

Jan 16 12:09:34 litsha22 crmd: [3904]: info: update_dc:utils.c Set DC to litsha22

(1.0.6)

Jan 16 12:09:35 litsha22 cib: [3900]: info: sync_our_cib:messages.c Syncing CIB to

litsha21

Jan 16 12:09:35 litsha22 crmd: [3904]: info: do_state_transition:fsa.c litsha22:

State transition S_INTEGRATION ->

S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]

Jan 16 12:09:35 litsha22 crmd: [3904]: info: do_state_transition:fsa.c All 2 cluster

nodes responded to the join offer.

Jan 16 12:09:35 litsha22 attrd: [3903]: info: attrd_local_callback:attrd.c Sending

full refresh

Jan 16 12:09:35 litsha22 cib: [3900]: info: sync_our_cib:messages.c Syncing CIB to

all peers

…………………….

Jan 16 12:09:37 litsha22 tengine: [5119]: info: send_rsc_command:actions.c Initiating

action 4: IPaddr_1_start_0 on litsha22

Jan 16 12:09:37 litsha22 tengine: [5119]: info: send_rsc_command:actions.c Initiating

action 2: probe_complete on litsha21

Jan 16 12:09:37 litsha22 crmd: [3904]: info: do_lrm_rsc_op:lrm.c Performing op start

on IPaddr_1 (interval=0ms,

key=2:c5131d14-a9d9-400c-a4b1-60d8f5fbbcce)

Jan 16 12:09:37 litsha22 pengine: [5120]: info: process_pe_message:pengine.c

Transition 2: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-72.bz2

Jan 16 12:09:37 litsha22 IPaddr[5196]: INFO: /sbin/ifconfig eth0:0 192.168.71.205

netmask 255.255.255.0 broadcast 192.168.71.255

Jan 16 12:09:37 litsha22 IPaddr[5196]: INFO: Sending Gratuitous Arp for

192.168.71.205 on eth0:0 [eth0]

Jan 16 12:09:37 litsha22 IPaddr[5196]: INFO: /usr/lib64/heartbeat/send_arp -i 500 -r

10 -p

/var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.71.205 eth0 192.168.71.205 auto

192.168.71.205 ffffffffffff

Jan 16 12:09:37 litsha22 crmd: [3904]: info: process_lrm_event:lrm.c LRM operation

(7) start_0 on IPaddr_1 complete

Jan 16 12:09:37 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Update (client:

3904, call:46): 0.55.607 -> 0.55.608 (ok)

Jan 16 12:09:37 litsha22 mgmtd: [3905]: debug: update cib finished

Jan 16 12:09:37 litsha22 tengine: [5119]: info: te_update_diff:callbacks.c Processing

diff (cib_update): 0.55.607 -> 0.55.608

Jan 16 12:09:37 litsha22 tengine: [5119]: info: match_graph_event:events.c Action

IPaddr_1_start_0 (4) confirmed

Jan 16 12:09:37 litsha22 tengine: [5119]: info: send_rsc_command:actions.c Initiating

action 5: IPaddr_1_monitor_5000 on litsha22

Jan 16 12:09:37 litsha22 crmd: [3904]: info: do_lrm_rsc_op:lrm.c Performing op

monitor on IPaddr_1 (interval=5000ms, key=2:c5131d14-a9d9-400c-a4b1-60d8f5fbbcce)

Jan 16 12:09:37 litsha22 cib: [5268]: info: write_cib_contents:io.c Wrote version

0.55.608 of the CIB to disk (digest: 98cb6685c25d14131c49a998dbbd0c35)

Jan 16 12:09:37 litsha22 crmd: [3904]: info: process_lrm_event:lrm.c LRM operation

(8) monitor_5000 on IPaddr_1 complete

Jan 16 12:09:38 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Update (client:

3904, call:47): 0.55.608 -> 0.55.609 (ok)

Jan 16 12:09:38 litsha22 mgmtd: [3905]: debug: update cib finished

在清单 13 中,您可以看见 quorum 已被重新建立。重新建立 quorum 后,执行投票,而 litsha22 将变为具有浮动资源的活动节点。



回页首

结束语

digg 提交到 Digg del.icio.us 发布到 el.icio.us

高可用性被视为一系列挑战,本文介绍的解决方案描述了第一个步骤。从这里开始,在您的开发环境中有多种方法可以继续操作:您可以选择安装冗余的网络、集群文件系统以支持 realserver,或安装更高级的中间件直接支持集群。