labunix's blog

labunixのラボUnix

STONITHプラグイン「external/ssh」でシステムを自動的に再起動してみる。

■その前に。
 3台以上の場合に有効なquorum-policyを無視する設定を追加。

$ cat quorum-policy.txt 
	configure
	property no-quorum-policy=ignore
	
$ sudo crm < quorum-policy.txt

■STONITHプラグインの確認
 external/sshはテスト用で本番では使わないこととなってるらしい。
 heartbeatプロセスがおかしくなったという条件でフェイルオーバーを試したい。

$ sudo crm ra list stonith
apcmaster                 apcmastersnmp             apcsmart
baytech                   bladehpi                  cyclades
drac3                     external/drac5            external/dracmc-telnet
external/hetzner          external/hmchttp          external/ibmrsa
external/ibmrsa-telnet    external/ipmi             external/ippower9258
external/kdumpcheck       external/libvirt          external/nut
external/rackpdu          external/riloe            external/sbd
external/ssh              external/vcenter          external/vmware
external/xen0             external/xen0-ha          fence_legacy
fence_pcmk                ibmhmc                    ipmilan
meatware                  null                      nw_rpc100s
rcd_serial                rps10                     ssh
suicide                   wti_mpc                   wti_nps

■fencingの設定ドキュメントから再確認。

$ lv -s /usr/share/doc/pacemaker/crm_fencing.txt.gz | grep "clone fencing"
        clone fencing st-null
        clone fencing st-ssh

$ sudo stonith -L | grep ssh
external/ssh
ssh

$ lv -s /usr/share/doc/pacemaker/crm_fencing.txt.gz | grep -A 5 "external/ssh config"
external/ssh configuration:

        primitive st-ssh stonith:external/ssh \
                params hostlist="node1 node2"
        clone fencing st-ssh

■そのまま設定しても、稼働中の設定は変更出来ない。

$ cp sample.txt external-ssh.txt
$ cat external-ssh.txt
        configure
        primitive st-ssh stonith:external/ssh \
                params hostlist="xen-debian1 xen-debian2"
        clone fencing st-ssh
        commit
$ sudo crm < external-ssh.txt
ERROR: 3: st-ssh: id is already in use
ERROR: 4: fencing: id is already in use
INFO: 5: apparently there is nothing to commit
INFO: 5: try changing something first

■crmd(heartbeat)を停止するとcrmで接続出来ない。
 haclusterユーザが起動している子プロセスは下記。

$ pstree -u hacluster | grep -v "^\$"
attrd
ccm
cib
crmd

■haclusterが所属するグループIDを持ったPPIDでpstreeコマンドを実行。
 親プロセスはheartbeat。

$ ps -ef | grep heartbeat | \
  grep "^`grep hacluster /etc/passwd | awk -F\: '{print $3}'`" | \
  awk '{print $3}' | \
  sort -u | \
  pstree `xargs`
heartbeat─┬─attrd
          ├─ccm
          ├─cib
          ├─crmd
          ├─3*[heartbeat]
          ├─lrmd
          └─stonithd

■脇道にそれた。
 cibadminでXMLを編集して置き換えることで稼働中の動作を変更出来る。
 既に上記で2つ出来ているので、1つを削除すれば良い。
 上記をやらなかった場合は、直接st-nullを書き換える。

$ sudo cibadmin --cib_query > tmp.xml

---編集---
 primitive class="stonith" id="st-ssh" type="external/ssh"
---編集---

$ sudo cibadmin --cib_replace --xml-file tmp.xml

$ sudo crm_mon -1
============
Last updated: Mon Jul 22 16:40:43 2013
Last change: Mon Jul 22 16:40:38 2013 via cibadmin on xen-debian
Stack: Heartbeat
Current DC: xen-debian1 (e0322e2c-d119-4b46-9be9-68cf992ce4d7) - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ xen-debian2 xen-debian1 ]

 st-null:1      (stonith:null):  ORPHANED Started xen-debian1
 st-ssh (stonith:external/ssh):  ORPHANED Started xen-debian2
 st-null:0      (stonith:null):  ORPHANED Started xen-debian2

$ sudo crm_mon -1
============
Last updated: Mon Jul 22 16:44:01 2013
Last change: Mon Jul 22 16:40:38 2013 via cibadmin on xen-debian1
Stack: Heartbeat
Current DC: xen-debian (e0322e2c-d119-4b46-9be9-68cf992ce4d7) - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ xen-debian2 xen-debian1 ]

 Clone Set: fencing [st-ssh]
     Started: [ kvm-debian xen-debian ]

■確認

$ cat check.txt 
	configure
	show

$ sudo crm < check.txt 
node $id="e0322e2c-d119-4b46-9be9-68cf992ce4d7" xen-debian1
node $id="fbe03c12-747e-45d7-8a25-469bc9c28f2d" xen-debian2
primitive st-ssh stonith:external/ssh \
	params hostlist="xen-debian1 xen-debian2"
clone fencing st-ssh \
	meta is-managed="true"
property $id="cib-bootstrap-options" \
	dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
	cluster-infrastructure="Heartbeat" \
	no-quorum-policy="ignore"

■XMLで出力して確認

$ sudo cibadmin --cib_query > after.xml
$ grep -B 1 -A 10 "clone id=.fencing" after.xml 
    <resources>
      <clone id="fencing">
        <meta_attributes id="fencing-meta_attributes">
          <nvpair id="fencing-meta_attributes-is-managed" name="is-managed" value="true"/>
        </meta_attributes>
        <primitive class="stonith" id="st-ssh" type="external/ssh">
          <instance_attributes id="st-ssh-instance_attributes">
            <nvpair id="st-ssh-instance_attributes-hostlist" name="hostlist" value="xen-debian1 xen-debian2"/>
          </instance_attributes>
        </primitive>
      </clone>
    </resources>

■もう少し見てみる。
 出来ることのうち、サポートされているのはホストの再起動だけ。

$ sudo  crm ra info stonith:external/ssh > stonith_ssh_parm.txt
$ head -5 stonith_ssh_parm.txt 
ssh STONITH device (stonith:external/ssh)

ssh-based host reset
Fine for testing, but not suitable for production!
Only reboot action supported, no poweroff, and, surprisingly enough, no poweron.

■atdはdaemonユーザの権限で実行中

$ ps -ef | grep '/atd' | grep -v grep 
daemon    2546     1  0 18:35 ?        00:00:00 /usr/sbin/atd

$ pstree -s 2546
init───atd

■空パスワードでrootのssh鍵を生成、何の入力もなしにログイン出来ること。

# ssh-keygen -t rsa
# sed -i s/"ermitEmptyPasswords no"/"ermitEmptyPasswords yes"/ \
  /etc/ssh/sshd_config 

# scp .ssh/id_rsa.pub xen-debian2:/root/.ssh/authorized_keys
# /etc/init.d/sshd restart

# ssh xen-debian1
#

■heartbeatの停止

$ sudo pkill heartbeat

■「st-ssh:1」が呼ばれたが失敗した様子。

$ sudo crm_mon -A -1
============
Last updated: Mon Jul 22 20:18:41 2013
Last change: Mon Jul 22 18:46:38 2013 via cibadmin on kvm-debian
Stack: Heartbeat
Current DC: xen-debian1 (e0322e2c-d119-4b46-9be9-68cf992ce4d7) - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Node xen-debian1 (e0322e2c-d119-4b46-9be9-68cf992ce4d7): UNCLEAN (online)
Online: [ xen-debian2 ]

 Clone Set: fencing [st-ssh]
     st-ssh:1	(stonith:external/ssh):	Started xen-debian FAILED
     Started: [ xen-debian2 ]

Node Attributes:
* Node xen-debian2:

Failed actions:
    st-ssh:1_stop_0 (node=xen-debian1, call=4, rc=1, status=complete): unknown error

■stopに失敗したことが分かる。

$ sudo cibadmin --cib_query > new.xml
$  grep 'call-id=.4' new.xml | sed s/"\" "/"&\n"/g | grep -v "^ \$"
            <lrm_rsc_op id="st-ssh:1_last_failure_0" 
operation_key="st-ssh:1_stop_0" 
operation="stop" 
crm-debug-origin="do_update_resource" 
crm_feature_set="3.0.6" 
transition-key="5:31:0:8b121599-418c-4782-9071-6e7c03136c9e" 
transition-magic="0:1;5:31:0:8b121599-418c-4782-9071-6e7c03136c9e" 
call-id="4" 
rc-code="1" 
op-status="0" 
interval="0" 
last-run="1374491599" 
last-rc-change="1374491599" 
exec-time="20" 
queue-time="0" 
op-digest="9f77447a60d93cdf5b720166621885d0"/>

■上記の原因はシステム再起動までに心配になって何度もheartbeatをrestartしたから。
 今度は成功した。

# crm_mon -A -1
============
Last updated: Mon Jul 22 20:42:37 2013
Last change: Mon Jul 22 20:27:23 2013 via cibadmin on xen-debian2
Stack: Heartbeat
Current DC: xen-debian1 (e0322e2c-d119-4b46-9be9-68cf992ce4d7) - partition with q
uorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ xen-debian2 xen-debian1 ]

 Clone Set: fencing [st-ssh]
     Started: [ xen-debian2 xen-debian1 ]

Node Attributes:
* Node xen-debian2:
* Node xen-debian1:

■分かりにくいのでログ。
 cibとかは除いた。

$ sudo grep -A 10000 "^Jul 22 20:50:56.*node .*is dead" /var/log/ha-log | \
  sed s/".* `uname -n` "//g
heartbeat: [4200]: WARN: node xen-debian1: is dead
heartbeat: [4200]: info: Link xen-debian1:eth0 dead.
crmd: [4214]: notice: crmd_ha_status_callback: Status update: Node xen-debian1 now has status [dead]
crmd: [4214]: info: crm_update_peer_proc: xen-debian1.ais is now offline
...
pengine: [4359]: notice: unpack_config: On loss of CCM Quorum: Ignore
pengine: [4359]: WARN: pe_fence_node: Node xen-debian1 will be fenced because it is un-expectedly down
pengine: [4359]: WARN: determine_online_status: Node xen-debian1 is unclean
pengine: [4359]: WARN: custom_action: Action st-ssh:1_stop_0 on xen-debian1 is unrunnable (offline)
pengine: [4359]: WARN: custom_action: Marking node xen-debian1 unclean
pengine: [4359]: WARN: stage6: Scheduling Node xen-debian1 for STONITH
pengine: [4359]: notice: LogActions: Stop    st-ssh:1   (xen-debian1)
...
crmd: [4214]: info: do_te_invoke: Processing graph 0 (ref=pe_calc-dc-1374493864-12) derived from /var/lib/pengine/pe-warn-2.bz2
crmd: [4214]: notice: te_fence_node: Executing reboot fencing operation (13) on xen-debian1 (timeout=60000)
stonith-ng: [4212]: info: initiate_remote_stonith_op: Initiating remote operation reboot for xen-debian1: d6c00c42-aff3-4819-92d9-86f0f24ddfd0
stonith-ng: [4212]: info: can_fence_host_with_device: Refreshing port list for st-ssh:0
stonith-ng: [4212]: info: can_fence_host_with_device: st-ssh:0 can fence xen-debian1: dynamic-list
...
perform op reboot xen-debian1
stonith-ng: [4212]: info: can_fence_host_with_device: st-ssh:0 can fence xen-debian1: dynamic-list
stonith-ng: [4212]: info: stonith_fence: Found 1 matching devices for 'xen-debian1'
stonith-ng: [4212]: info: stonith_command: Processed st_fence from xen-debian2: rc=-1
heartbeat: [4200]: info: Heartbeat restart on node xen-debian1
heartbeat: [4200]: info: Link xen-debian1:eth0 up.
heartbeat: [4200]: info: Status update for node xen-debian1: status init
heartbeat: [4200]: info: Status update for node xen-debian1: status up
crmd: [4214]: notice: crmd_ha_status_callback: Status update: Node xen-debian1 now has status [init]
crmd: [4214]: info: crm_update_peer_proc: xen-debian1.ais is now online
crmd: [4214]: notice: crmd_ha_status_callback: Status update: Node xen-debian1 now has status [up]
heartbeat: [4200]: info: Status update for node xen-debian1: status active
crmd: [4214]: notice: crmd_ha_status_callback: Status update: Node xen-debian1 now has status [active]
heartbeat: [4200]: info: No pkts missing from xen-debian1!
crmd: [4214]: notice: crmd_client_status_callback: Status update: Client xen-debian1/crmd now has status [online] (DC=true)