■その前に。
3台以上の場合に有効なquorum-policyを無視する設定を追加。
$ cat quorum-policy.txt
configure
property no-quorum-policy=ignore
$ sudo crm < quorum-policy.txt
■STONITHプラグインの確認
external/sshはテスト用で本番では使わないこととなってるらしい。
heartbeatプロセスがおかしくなったという条件でフェイルオーバーを試したい。
$ sudo crm ra list stonith
apcmaster apcmastersnmp apcsmart
baytech bladehpi cyclades
drac3 external/drac5 external/dracmc-telnet
external/hetzner external/hmchttp external/ibmrsa
external/ibmrsa-telnet external/ipmi external/ippower9258
external/kdumpcheck external/libvirt external/nut
external/rackpdu external/riloe external/sbd
external/ssh external/vcenter external/vmware
external/xen0 external/xen0-ha fence_legacy
fence_pcmk ibmhmc ipmilan
meatware null nw_rpc100s
rcd_serial rps10 ssh
suicide wti_mpc wti_nps
■fencingの設定ドキュメントから再確認。
$ lv -s /usr/share/doc/pacemaker/crm_fencing.txt.gz | grep "clone fencing"
clone fencing st-null
clone fencing st-ssh
$ sudo stonith -L | grep ssh
external/ssh
ssh
$ lv -s /usr/share/doc/pacemaker/crm_fencing.txt.gz | grep -A 5 "external/ssh config"
external/ssh configuration:
primitive st-ssh stonith:external/ssh \
params hostlist="node1 node2"
clone fencing st-ssh
■そのまま設定しても、稼働中の設定は変更出来ない。
$ cp sample.txt external-ssh.txt
$ cat external-ssh.txt
configure
primitive st-ssh stonith:external/ssh \
params hostlist="xen-debian1 xen-debian2"
clone fencing st-ssh
commit
$ sudo crm < external-ssh.txt
ERROR: 3: st-ssh: id is already in use
ERROR: 4: fencing: id is already in use
INFO: 5: apparently there is nothing to commit
INFO: 5: try changing something first
■crmd(heartbeat)を停止するとcrmで接続出来ない。
haclusterユーザが起動している子プロセスは下記。
$ pstree -u hacluster | grep -v "^\$"
attrd
ccm
cib
crmd
■haclusterが所属するグループIDを持ったPPIDでpstreeコマンドを実行。
親プロセスはheartbeat。
$ ps -ef | grep heartbeat | \
grep "^`grep hacluster /etc/passwd | awk -F\: '{print $3}'`" | \
awk '{print $3}' | \
sort -u | \
pstree `xargs`
heartbeat─┬─attrd
├─ccm
├─cib
├─crmd
├─3*[heartbeat]
├─lrmd
└─stonithd
■脇道にそれた。
cibadminでXMLを編集して置き換えることで稼働中の動作を変更出来る。
既に上記で2つ出来ているので、1つを削除すれば良い。
上記をやらなかった場合は、直接st-nullを書き換える。
$ sudo cibadmin --cib_query > tmp.xml
---編集---
primitive class="stonith" id="st-ssh" type="external/ssh"
---編集---
$ sudo cibadmin --cib_replace --xml-file tmp.xml
$ sudo crm_mon -1
============
Last updated: Mon Jul 22 16:40:43 2013
Last change: Mon Jul 22 16:40:38 2013 via cibadmin on xen-debian
Stack: Heartbeat
Current DC: xen-debian1 (e0322e2c-d119-4b46-9be9-68cf992ce4d7) - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
2 Resources configured.
============
Online: [ xen-debian2 xen-debian1 ]
st-null:1 (stonith:null): ORPHANED Started xen-debian1
st-ssh (stonith:external/ssh): ORPHANED Started xen-debian2
st-null:0 (stonith:null): ORPHANED Started xen-debian2
$ sudo crm_mon -1
============
Last updated: Mon Jul 22 16:44:01 2013
Last change: Mon Jul 22 16:40:38 2013 via cibadmin on xen-debian1
Stack: Heartbeat
Current DC: xen-debian (e0322e2c-d119-4b46-9be9-68cf992ce4d7) - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
2 Resources configured.
============
Online: [ xen-debian2 xen-debian1 ]
Clone Set: fencing [st-ssh]
Started: [ kvm-debian xen-debian ]
■確認
$ cat check.txt
configure
show
$ sudo crm < check.txt
node $id="e0322e2c-d119-4b46-9be9-68cf992ce4d7" xen-debian1
node $id="fbe03c12-747e-45d7-8a25-469bc9c28f2d" xen-debian2
primitive st-ssh stonith:external/ssh \
params hostlist="xen-debian1 xen-debian2"
clone fencing st-ssh \
meta is-managed="true"
property $id="cib-bootstrap-options" \
dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
cluster-infrastructure="Heartbeat" \
no-quorum-policy="ignore"
■XMLで出力して確認
$ sudo cibadmin --cib_query > after.xml
$ grep -B 1 -A 10 "clone id=.fencing" after.xml
<resources>
<clone id="fencing">
<meta_attributes id="fencing-meta_attributes">
<nvpair id="fencing-meta_attributes-is-managed" name="is-managed" value="true"/>
</meta_attributes>
<primitive class="stonith" id="st-ssh" type="external/ssh">
<instance_attributes id="st-ssh-instance_attributes">
<nvpair id="st-ssh-instance_attributes-hostlist" name="hostlist" value="xen-debian1 xen-debian2"/>
</instance_attributes>
</primitive>
</clone>
</resources>
■もう少し見てみる。
出来ることのうち、サポートされているのはホストの再起動だけ。
$ sudo crm ra info stonith:external/ssh > stonith_ssh_parm.txt
$ head -5 stonith_ssh_parm.txt
ssh STONITH device (stonith:external/ssh)
ssh-based host reset
Fine for testing, but not suitable for production!
Only reboot action supported, no poweroff, and, surprisingly enough, no poweron.
■atdはdaemonユーザの権限で実行中
$ ps -ef | grep '/atd' | grep -v grep
daemon 2546 1 0 18:35 ? 00:00:00 /usr/sbin/atd
$ pstree -s 2546
init───atd
■空パスワードでrootのssh鍵を生成、何の入力もなしにログイン出来ること。
/etc/ssh/sshd_config
■heartbeatの停止
$ sudo pkill heartbeat
■「st-ssh:1」が呼ばれたが失敗した様子。
$ sudo crm_mon -A -1
============
Last updated: Mon Jul 22 20:18:41 2013
Last change: Mon Jul 22 18:46:38 2013 via cibadmin on kvm-debian
Stack: Heartbeat
Current DC: xen-debian1 (e0322e2c-d119-4b46-9be9-68cf992ce4d7) - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
2 Resources configured.
============
Node xen-debian1 (e0322e2c-d119-4b46-9be9-68cf992ce4d7): UNCLEAN (online)
Online: [ xen-debian2 ]
Clone Set: fencing [st-ssh]
st-ssh:1 (stonith:external/ssh): Started xen-debian FAILED
Started: [ xen-debian2 ]
Node Attributes:
* Node xen-debian2:
Failed actions:
st-ssh:1_stop_0 (node=xen-debian1, call=4, rc=1, status=complete): unknown error
■stopに失敗したことが分かる。
$ sudo cibadmin --cib_query > new.xml
$ grep 'call-id=.4' new.xml | sed s/"\" "/"&\n"/g | grep -v "^ \$"
<lrm_rsc_op id="st-ssh:1_last_failure_0"
operation_key="st-ssh:1_stop_0"
operation="stop"
crm-debug-origin="do_update_resource"
crm_feature_set="3.0.6"
transition-key="5:31:0:8b121599-418c-4782-9071-6e7c03136c9e"
transition-magic="0:1;5:31:0:8b121599-418c-4782-9071-6e7c03136c9e"
call-id="4"
rc-code="1"
op-status="0"
interval="0"
last-run="1374491599"
last-rc-change="1374491599"
exec-time="20"
queue-time="0"
op-digest="9f77447a60d93cdf5b720166621885d0"/>
■上記の原因はシステム再起動までに心配になって何度もheartbeatをrestartしたから。
今度は成功した。
============
Last updated: Mon Jul 22 20:42:37 2013
Last change: Mon Jul 22 20:27:23 2013 via cibadmin on xen-debian2
Stack: Heartbeat
Current DC: xen-debian1 (e0322e2c-d119-4b46-9be9-68cf992ce4d7) - partition with q
uorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
2 Resources configured.
============
Online: [ xen-debian2 xen-debian1 ]
Clone Set: fencing [st-ssh]
Started: [ xen-debian2 xen-debian1 ]
Node Attributes:
* Node xen-debian2:
* Node xen-debian1:
■分かりにくいのでログ。
cibとかは除いた。
$ sudo grep -A 10000 "^Jul 22 20:50:56.*node .*is dead" /var/log/ha-log | \
sed s/".* `uname -n` "//g
heartbeat: [4200]: WARN: node xen-debian1: is dead
heartbeat: [4200]: info: Link xen-debian1:eth0 dead.
crmd: [4214]: notice: crmd_ha_status_callback: Status update: Node xen-debian1 now has status [dead]
crmd: [4214]: info: crm_update_peer_proc: xen-debian1.ais is now offline
...
pengine: [4359]: notice: unpack_config: On loss of CCM Quorum: Ignore
pengine: [4359]: WARN: pe_fence_node: Node xen-debian1 will be fenced because it is un-expectedly down
pengine: [4359]: WARN: determine_online_status: Node xen-debian1 is unclean
pengine: [4359]: WARN: custom_action: Action st-ssh:1_stop_0 on xen-debian1 is unrunnable (offline)
pengine: [4359]: WARN: custom_action: Marking node xen-debian1 unclean
pengine: [4359]: WARN: stage6: Scheduling Node xen-debian1 for STONITH
pengine: [4359]: notice: LogActions: Stop st-ssh:1 (xen-debian1)
...
crmd: [4214]: info: do_te_invoke: Processing graph 0 (ref=pe_calc-dc-1374493864-12) derived from /var/lib/pengine/pe-warn-2.bz2
crmd: [4214]: notice: te_fence_node: Executing reboot fencing operation (13) on xen-debian1 (timeout=60000)
stonith-ng: [4212]: info: initiate_remote_stonith_op: Initiating remote operation reboot for xen-debian1: d6c00c42-aff3-4819-92d9-86f0f24ddfd0
stonith-ng: [4212]: info: can_fence_host_with_device: Refreshing port list for st-ssh:0
stonith-ng: [4212]: info: can_fence_host_with_device: st-ssh:0 can fence xen-debian1: dynamic-list
...
perform op reboot xen-debian1
stonith-ng: [4212]: info: can_fence_host_with_device: st-ssh:0 can fence xen-debian1: dynamic-list
stonith-ng: [4212]: info: stonith_fence: Found 1 matching devices for 'xen-debian1'
stonith-ng: [4212]: info: stonith_command: Processed st_fence from xen-debian2: rc=-1
heartbeat: [4200]: info: Heartbeat restart on node xen-debian1
heartbeat: [4200]: info: Link xen-debian1:eth0 up.
heartbeat: [4200]: info: Status update for node xen-debian1: status init
heartbeat: [4200]: info: Status update for node xen-debian1: status up
crmd: [4214]: notice: crmd_ha_status_callback: Status update: Node xen-debian1 now has status [init]
crmd: [4214]: info: crm_update_peer_proc: xen-debian1.ais is now online
crmd: [4214]: notice: crmd_ha_status_callback: Status update: Node xen-debian1 now has status [up]
heartbeat: [4200]: info: Status update for node xen-debian1: status active
crmd: [4214]: notice: crmd_ha_status_callback: Status update: Node xen-debian1 now has status [active]
heartbeat: [4200]: info: No pkts missing from xen-debian1!
crmd: [4214]: notice: crmd_client_status_callback: Status update: Client xen-debian1/crmd now has status [online] (DC=true)