当主 DB 服务器和连接管理器/ICM 同时终止，然后 ICM 重新联机时，Informix HDR 故障转移不起作用

Question 1

做了一个实验室设置，我会发布一些设置和日志，以便您比较。

我有 Informix 服务器ifx_a作为主服务器machine1，并且有 Informix 服务器ifx_b作为 HDR 辅助服务器machine2。

在 Informix 服务器上，ifx_a我具有以下复制参数。 HA_ALIAS等于DBSERVERNAME，以保持简单）。 Informix 服务器ifx_b参数相同，但HA_ALIAS和除外DBSERVERNAME。

DRAUTO                  3
DRINTERVAL              0
HDR_TXN_SCOPE           ASYNC
DRTIMEOUT               30
HA_ALIAS                ifx_a
HA_FOC_ORDER            SDS,HDR,RSS
DRLOSTFOUND             $INFORMIXDIR/etc/dr.lostfound
DRIDXAUTO               0
LOG_INDEX_BUILDS        1
SDS_ENABLE
SDS_TIMEOUT             20
SDS_TEMPDBS
SDS_PAGING
SDS_LOGCHECK            10
SDS_ALTERNATE           NONE
SDS_FLOW_CONTROL        0
UPDATABLE_SECONDARY     0
FAILOVER_CALLBACK
FAILOVER_TX_TIMEOUT     0
TEMPTAB_NOLOG           1
DELAY_APPLY             0
STOP_APPLY              0
LOG_STAGING_DIR         $INFORMIXDIR/tmp
RSS_FLOW_CONTROL        0
SMX_NUMPIPES            1
ENABLE_SNAPSHOT_COPY    0
SMX_COMPRESS            0
SMX_PING_INTERVAL       10
SMX_PING_RETRY          6
CLUSTER_TXN_SCOPE       SERVER

sqlhosts两个 Informix 服务器的文件是：

ifx_a    onsoctcp    machine1.local    15010    k=1
ifx_b    onsoctcp    machine2.local    15020    k=1

复制已设置并运行，在我运行的辅助服务器上ifx_b：

$ onstat -g cluster

IBM Informix Dynamic Server Version 12.10.FC12 -- Read-Only (Sec) -- Up 00:05:08 -- 156276 Kbytes

Primary Server:ifx_a
Index page logging status: Enabled
Index page logging was enabled at: 2018/07/31 23:44:14


Server ACKed Log    Supports     Status
       (log, page)  Updates
ifx_b  16,6         No           ASYNC(HDR),Connected,On


$ onstat -g dri

IBM Informix Dynamic Server Version 12.10.FC12 -- Read-Only (Sec) -- Up 00:12:23 -- 156276 Kbytes

Data Replication at 0x45ac6028:
  Type           State        Paired server        Last DR CKPT (id/pg)    Supports Proxy Writes
  HDR Secondary  on           ifx_a                        16 / 5          N

  DRINTERVAL   0
  DRTIMEOUT    30
  DRAUTO       3
  DRLOSTFOUND  /opt/IBM/informix/12.10/etc/dr.lostfound
  DRIDXAUTO    0
  ENCRYPT_HDR  0
  Backlog      0
  Last Send    2018/08/01 00:12:12
  Last Receive 2018/08/01 00:12:12
  Last Ping    2018/08/01 00:12:03
  Last log page applied(log id,page): 0,0

在主服务器上ifx_a我运行：

$ onstat -g cluster

IBM Informix Dynamic Server Version 12.10.FC12 -- On-Line (Prim) -- Up 00:21:30 -- 156276 Kbytes

Primary Server:ifx_a
Current Log Page:16,6
Index page logging status: Enabled
Index page logging was enabled at: 2018/07/31 23:44:14


Server ACKed Log    Applied Log  Supports     Status
       (log, page)  (log, page)  Updates
ifx_b  16,6         16,6         No           ASYNC(HDR),Connected,On


$ onstat -g dri

IBM Informix Dynamic Server Version 12.10.FC12 -- On-Line (Prim) -- Up 00:27:49 -- 156276 Kbytes

Data Replication at 0x45ac6028:
  Type           State        Paired server        Last DR CKPT (id/pg)    Supports Proxy Writes
  primary        on           ifx_b                        16 / 5          NA

  DRINTERVAL   0
  DRTIMEOUT    30
  DRAUTO       3
  DRLOSTFOUND  /opt/IBM/informix/12.10/etc/dr.lostfound
  DRIDXAUTO    0
  ENCRYPT_HDR  0
  Backlog      0
  Last Send    2018/08/01 00:11:53
  Last Receive 2018/08/01 00:11:53
  Last Ping    2018/08/01 00:11:32
  Last log page applied(log id,page): 16,6

因此复制有效并且服务器同步。

在第三台机器上，machine3我使用以下参数设置了连接管理器：

NAME            icm_1
LOG             1
LOGFILE         ${INFORMIXDIR}/tmp/icm_1.log
DEBUG           1

CM_TIMEOUT 60
EVENT_TIMEOUT 60
SECONDARY_EVENT_TIMEOUT 60
SQLHOSTS LOCAL

CLUSTER ifx_abc
{
    INFORMIXSERVER servers_abc

    SLA sla_ifx_abc    DBSERVERS=PRI
    FOC ORDER=ENABLED \
       PRIORITY=1
}

sqlhosts连接管理器的文件：

servers_abc  group       -                 -        i=10,c=1,e=ifx_b
ifx_a        onsoctcp    machine1.local    15010    k=1,g=servers_abc
ifx_b        onsoctcp    machine2.local    15020    k=1,g=servers_abc

sla_ifx_abc  onsoctcp    machine3.local    15030    k=1

现在我强行关闭了machine1Informix 服务器，并在其在线日志上得到了以下信息ifx_b：

08/02/18 01:00:26  Maximum server connections 1
08/02/18 01:00:26  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 38, Llog used 0

08/02/18 01:06:39  The SMX connection between high availability servers was closed because the
 peer server was unresponsive for the timeout period (60 seconds times the
 number of retries).
08/02/18 01:06:39  The SMX connection between high availability servers was closed because the
 peer server was unresponsive for the timeout period (60 seconds times the
 number of retries).
08/02/18 01:07:28  DR: ping timeout
08/02/18 01:07:28  DR: Receive error
08/02/18 01:07:28  dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
 System error = 4.
08/02/18 01:07:28  DR_ERR set to -1
08/02/18 01:07:28  DR: Receive Btree error
08/02/18 01:07:29  DR: Turned off on secondary server
08/02/18 01:07:36  SCHAPI: Issued Task() or Admin() command "task( 'ha make primary force', 'ifx_b' )".
08/02/18 01:07:37  Skipping failover callback.
08/02/18 01:07:48  Logical Recovery has reached the transaction cleanup phase.
08/02/18 01:07:48  Checkpoint Completed:  duration was 0 seconds.
08/02/18 01:07:48  Thu Aug  2 - loguniq 19, logpos 0xcb4018, timestamp: 0xf287b Interval: 148

08/02/18 01:07:48  Maximum server connections 2
08/02/18 01:07:48  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 72, Llog used 1

08/02/18 01:07:48  Logical Recovery Complete.
          9001 Committed, 0 Rolled Back, 0 Open, 0 Bad Locks

08/02/18 01:07:48  Logical Recovery Complete.
08/02/18 01:07:48  Performance Advisory: Based on the current workload, the physical log might be too small to
accommodate the time it takes to flush the buffer pool.
08/02/18 01:07:48   Results: The server might block transactions during checkpoints.
08/02/18 01:07:48   Action: If transactions are blocked during the checkpoint, increase the size of the
 physical log to at least 716800 KB.
08/02/18 01:07:48  Performance Advisory: Based on the current workload, the logical log space might be too small to
accommodate the time it takes to flush the buffer pool.
08/02/18 01:07:48   Results: The server might block transactions during checkpoints.
08/02/18 01:07:48   Action: If transactions are blocked during the checkpoint, increase the size of the
 logical log space to at least 89600 KB.
08/02/18 01:07:48  Performance Advisory: The physical log is too small for automatic checkpoints.
08/02/18 01:07:48   Results: Automatic checkpoints are disabled.
08/02/18 01:07:48   Action: To enable automatic checkpoints, increase the physical log to at least 716800 KB.
08/02/18 01:07:48  Quiescent Mode
08/02/18 01:07:48  Checkpoint Completed:  duration was 0 seconds.
08/02/18 01:07:48  Thu Aug  2 - loguniq 19, logpos 0xcb6018, timestamp: 0xf288d Interval: 149

08/02/18 01:07:48  Maximum server connections 2
08/02/18 01:07:48  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 4, Llog used 2

08/02/18 01:07:48  B-tree scanners enabled.
08/02/18 01:07:48  DR: Reservation of the last logical log for log backup turned on
08/02/18 01:07:48  DR: new type = primary, secondary server name = ifx_a
08/02/18 01:07:48  DR: Trying to connect to secondary server = ifx_a
08/02/18 01:07:48  Starting BldNotification
08/02/18 01:07:49  On-Line Mode
08/02/18 01:07:50  SCHAPI: Started dbScheduler thread.
08/02/18 01:07:50  Auto Registration is synced
08/02/18 01:07:50  SCHAPI: Started 2 dbWorker threads.
08/02/18 01:07:51  DR: Cannot connect to secondary server
08/02/18 01:07:51  DR: Turned off on primary server
08/02/18 01:07:51  DDR Log staging: Using the directory /opt/IBM/informix/12.10/tmp/ifmxddrlog_2.
08/02/18 01:07:51  SCHAPI: dbutil threads is already running.
08/02/18 01:07:51  SCHAPI: dbScheduler threads is already running.
08/02/18 01:07:52  Defragmenter cleaner thread now running
08/02/18 01:07:52  Defragmenter cleaner thread cleaned:0 partitions
08/02/18 01:08:12  DR: Cannot connect to secondary server
08/02/18 01:08:12  DR: Turned off on primary server
08/02/18 01:09:14  DR: Cannot connect to secondary server
08/02/18 01:09:14  DR: Turned off on primary server
08/02/18 01:10:14  DR: Cannot connect to secondary server
08/02/18 01:10:14  DR: Turned off on primary server
08/02/18 01:11:14  DR: Cannot connect to secondary server
08/02/18 01:11:14  DR: Turned off on primary server
08/02/18 01:12:14  DR: Cannot connect to secondary server
08/02/18 01:12:14  DR: Turned off on primary server
08/02/18 01:12:53  Checkpoint Completed:  duration was 0 seconds.
08/02/18 01:12:53  Thu Aug  2 - loguniq 19, logpos 0xcf0704, timestamp: 0xf3ab0 Interval: 150

08/02/18 01:12:53  Maximum server connections 2
08/02/18 01:12:53  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 39, Llog used 58

08/02/18 01:13:14  DR: Cannot connect to secondary server
08/02/18 01:13:14  DR: Turned off on primary server

在连接管理器日志中我得到：

Thu Aug  2 00:57:42 2018
00:57:42 IBM Informix Connection Manager
00:57:42 IBM Informix CSDK Version 4.10, IBM Informix-ESQL Version 4.10.FC12
00:57:42 Build Number:  N037
00:57:42 Build Host:    hans
00:57:42 Build OS:      Linux 2.6.18-128.el5
00:57:42 Build Date:    Tue Jun 26 08:55:45 CDT 2018
00:57:42 GLS Version:   glslib-6.00.FC13
00:57:42
00:57:42 set CM_TIMEOUT 60
00:57:42 set global EVENT_TIMEOUT 60
00:57:42 set global SECONDARY_EVENT_TIMEOUT 60
00:57:42 DEBUG[TID139959615362880]:add type PRI to SLA sla_ifx_abc [cmsm_sla.c:parse_sla:2412]
00:57:42 SLA sla_ifx_abc listener mode REDIRECT
00:57:42 Connection Manager name is icm_1
00:57:42 Starting Connection Manager...
00:57:42 DEBUG[TID139959615362880]:Password File /opt/IBM/informix/4.10/etc/passwd_file failed error:No such file or directory[2] [cmsm_main.c:cmsm_pw_init:1782]
00:57:42 Warning: Password Manager failed; working in trusted node mode
00:57:42 set the maximum number of file descriptors 32767 failed
00:57:42 DEBUG[TID139959615362880]:setrlimit(RLIMIT_NOFILE, 32767) = -1, error:Operation not permitted[1] [cmsm_main.c:cmsm_run:4901]
00:57:42 the current maximum number of file descriptors is 1024
00:57:42 DEBUG[TID139959615362880]:new daemon pid is 11761 [cmsm_main.c:cmsm_daemonize:4653]
00:57:42 DEBUG[TID139959615362880]:dbservername = servers_abc [cmsm_server.ec:cmsm_primary:2825]
00:57:42 DEBUG[TID139959615362880]:nettype =  [cmsm_server.ec:cmsm_primary:2826]
00:57:42 DEBUG[TID139959615362880]:hostname = - [cmsm_server.ec:cmsm_primary:2827]
00:57:42 DEBUG[TID139959615362880]:servicename = - [cmsm_server.ec:cmsm_primary:2828]
00:57:42 DEBUG[TID139959615362880]:options = i=10,c=1,e=ifx_b [cmsm_server.ec:cmsm_primary:2829]
00:57:42 DEBUG[TID139959615362880]:connect to servers_abc for CLUSTER ifx_abc [cmsm_server.ec:cmsm_primary:2845]
00:57:42 DEBUG[TID139959615362880]:connect to group servers_abc server ifx_a for CLUSTER ifx_abc [cmsm_server.ec:cmsm_primary:2866]
00:57:42 DEBUG[TID139959615362880]:create new thread -509528320 for ifx_a [cmsm_sla.c:cmsm_update_server_ex:996]
00:57:42 DEBUG[TID139959615362880]:connect to group servers_abc server ifx_b for CLUSTER ifx_abc [cmsm_server.ec:cmsm_primary:2866]
00:57:42 DEBUG[TID139959615362880]:create new thread -511629568 for ifx_b [cmsm_sla.c:cmsm_update_server_ex:996]
00:57:42 listener sla_ifx_abc initializing
00:57:42 DEBUG[TID139959585543936]:listenter sla_ifx_abc host=machine3.local service=15030 nettype=soctcp engtypeon [cmsm_main.c:cmsm_listener_thread:1166]
00:57:42 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a1   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
00:57:42 DEBUG[TID139959589746432]:database sysmaster @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a1   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
00:57:42 DEBUG[TID139959589746432]:ifx_a protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2433]
00:57:42 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator setting primary name = ifx_a [cmsm_arb.c:arb_set_primary:793]
00:57:42 Cluster ifx_abc Arbitrator FOC ORDER=ENABLED        PRIORITY=1
00:57:42 Connection Manager successfully connected to ifx_a
00:57:42 DEBUG[TID139959587645184]:CONNECT to @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b1   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
00:57:42 DEBUG[TID139959587645184]:database sysmaster @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b1   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
00:57:42 Listener sla_ifx_abc DBSERVERS=PRI is active with 4 worker threads
00:57:42 DEBUG[TID139959587645184]:ifx_b protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2433]
00:57:42 Connection Manager successfully connected to ifx_b
00:57:42 DEBUG[TID139959589746432]:server ifx_a version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2994]
00:57:42 DEBUG[TID139959589746432]:ACTIVE MONITOR ifx_a ifx_a 192.168.56.101 49464 [cmsm_er.ec:cmsm_er_monitor:3036]
00:57:42 DEBUG[TID139959589746432]:register CM successfully icm_1 with token 1782701540 [cmsm_server.ec:cmsm_cm_register:2336]
00:57:42 DEBUG[TID139959587645184]:server ifx_b version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2994]
00:57:42 DEBUG[TID139959587645184]:ACTIVE MONITOR ifx_b ifx_b 192.168.56.101 44376 [cmsm_er.ec:cmsm_er_monitor:3036]
00:57:42 DEBUG[TID139959589746432]:Register server ifx_a for ifx_a count = 1 [cmsm_er.ec:cmsm_er_event_process:1281]
00:57:42 DEBUG[TID139959587645184]:register CM successfully icm_1 with token 1120237392 [cmsm_server.ec:cmsm_cm_register:2336]
00:57:42 DEBUG[TID139959587645184]:Register server ifx_b for ifx_b count = 1 [cmsm_er.ec:cmsm_er_event_process:1281]
00:57:43 DEBUG[TID139959615362880]:fcntl(/opt/IBM/informix/4.10/tmp/cmsm.pid.icm_1) success error:Success[0] [cmsm_main.c:cmsm_pidfile:2193]
00:57:43 Connection Manager started successfully
00:57:43 DEBUG[TID139959589746432]:SQL get ifx_a event SRV_ADM 16:5 SDS,HDR,RSS [cmsm_er.ec:cmsm_er_event_process:1883]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
00:57:43 CM icm_1 arbitrator for ifx_abc is active
00:57:43 Cluster ifx_abc Arbitrator FOC ORDER=SDS,HDR,RSS PRIORITY=1 TIMEOUT=1
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
01:01:25 DEBUG[TID139959583442688]:SLA sla_ifx_abc 3 ifx_a time=1 latency= 0.00 readyQ=19 session=0 adjustSession=0 [cmsm_sla.c:select_server_from_sla:2947]
01:01:25 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:01:39 DEBUG[TID139959581341440]:SLA sla_ifx_abc 3 ifx_a time=0 latency= 0.00 readyQ=18 session=0 adjustSession=0 [cmsm_sla.c:select_server_from_sla:2947]
01:01:39 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:01:39 DEBUG[TID139959579240192]:SLA sla_ifx_abc 3 ifx_a time=0 latency= 0.00 readyQ=18 session=0 adjustSession=1 [cmsm_sla.c:select_server_from_sla:2947]
01:01:39 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:02:09 DEBUG[TID139959577138944]:SLA sla_ifx_abc 3 ifx_a time=6 latency= 0.00 readyQ=16 session=0 adjustSession=0 [cmsm_sla.c:select_server_from_sla:2947]
01:02:09 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:06:16 Server ifx_a no response over EVENT_TIMEOUT 60 seconds
01:06:16 Connection Manager disconnected from ifx_a
01:06:16 DEBUG[TID139959589746432]:start failover in sqlbreakcallback for ifx_a [cmsm_server.ec:cmsm_event_callback_common:1274]
01:06:16 DEBUG[TID139959575037696]:starting the failover process [cmsm_server.ec:cmsm_failover_incallback:1184]
01:06:17 Cluster ifx_abc Arbitrator detected primary server is down.
01:06:17 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:06:17 Cluster ifx_abc Arbitrator will not perform failover.
01:06:17 DEBUG[TID139959575037696]:failover process returns -1 [cmsm_server.ec:cmsm_failover_incallback:1207]
01:06:39 DEBUG[TID139959587645184]:create new thread -524237056 for ifx_a [cmsm_sla.c:cmsm_update_server_ex:996]
01:06:39 DEBUG[TID139959587645184]:create new thread -526338304 for ifx_a [cmsm_sla.c:cmsm_update_server_ex:996]
01:06:39 DEBUG[TID139959575037696]:Exit server ifx_a monitor machine3.local thread status=8 [cmsm_er.ec:cmsm_er_monitor:2608]
01:07:19 DEBUG[TID139959589746432]:fetch sysrepstats_cursor SQLCODE = (-213,0,) [cmsm_er.ec:cmsm_er_event_process:1711]
01:07:19 DEBUG[TID139959589746432]:unregister event failed, sqlcode = (-1811,0,) [cmsm_server.ec:cmsm_cm_unregister:2383]
01:07:19 DEBUG[TID139959589746432]:Unregister server ifx_a for ifx_a count = 0 [cmsm_er.ec:cmsm_er_event_process:2092]
01:07:19 Connection Manager disconnected from ifx_a
01:07:19 ALARM 3002 detected lost connection to Informix server ifx_a from machine3.local
01:07:19 DEBUG[TID139959589746432]:Connection Manager disconnected from ifx_a sqlcode=-1803 [cmsm_er.ec:cmsm_er_monitor:3090]
01:07:19 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:19 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:19 Cluster ifx_abc Arbitrator will not perform failover.
01:07:19 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:19 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:19 Cluster ifx_abc Arbitrator will not perform failover.
01:07:20 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:20 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:20 Cluster ifx_abc Arbitrator will not perform failover.
01:07:21 DEBUG[TID139959572936448]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a2   SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:21 ALARM 3001 unable to connect to Informix server ifx_a from machine3.local error -908
01:07:21 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:21 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:21 Cluster ifx_abc Arbitrator will not perform failover.
01:07:22 DEBUG[TID139959572936448]:Exit server ifx_a monitor machine3.local thread status=8 [cmsm_er.ec:cmsm_er_monitor:2608]
01:07:24 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a3   SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:24 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:24 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:24 Cluster ifx_abc Arbitrator will not perform failover.
01:07:27 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a4   SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:27 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:27 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:27 Cluster ifx_abc Arbitrator will not perform failover.
01:07:30 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:07:30 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a5   SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:30 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator start failover logic now... [cmsm_arb.c:arb_server_down:1564]
01:07:30 ALARM 2001 failover arbitrator automated failover in progress.
01:07:30 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator try failover on generic SDS [cmsm_arb.c:arb_failover_by_foc:1174]
01:07:30 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator try failover on generic HDR [cmsm_arb.c:arb_failover_by_foc:1200]
01:07:30 DEBUG[TID139959589746432]:CLuster ifx_abc Arbitrator failover to ifx_b, waiting 150 seconds for confirmation [cmsm_arb.c:arb_post_wait:1764]
01:07:30 DEBUG[TID139959589746432]:Arbitrator found failover node = ifx_b [cmsm_server_arb.ec:arb_failover_to:303]
01:07:37 DEBUG[TID139959589746432]:CONNECT to @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b2   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:37 DEBUG[TID139959589746432]:database sysmaster @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b2   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
01:07:37 DEBUG[TID139959589746432]:Arbitrator connected to failover node = ifx_b [cmsm_server_arb.ec:arb_failover_to:316]
01:07:37 DEBUG[TID139959589746432]:Arbitrator get HA_ALIAS on ifx_b, HA_ALIAS [ifx_b] [cmsm_server_arb.ec:arb_failover_to:326]
01:07:50 DEBUG[TID139959587645184]:SQL get ifx_b event SRV_ADM 16:3 1 [cmsm_er.ec:cmsm_er_event_process:1883]
01:07:50 Server ifx_b is in quiescent mode.
01:07:50 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:07:50 DEBUG[TID139959587645184]:SQL get ifx_b event SRV_ADM 16:3 5 [cmsm_er.ec:cmsm_er_event_process:1883]
01:07:50 The server type of cluster ifx_abc server ifx_b is Primary.
01:07:50 DEBUG[TID139959587645184]:Cluster ifx_abc Arbitrator setting primary name = ifx_b [cmsm_arb.c:arb_set_primary:793]
01:07:50 Cluster ifx_abc Arbitrator FOC ORDER=SDS,HDR,RSS PRIORITY=1 TIMEOUT=1
01:07:50 Server ifx_b is in on-line mode.
01:07:50 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:5[NEWPRI] ifx_b [cmsm_er.ec:cmsm_er_event_process:1734]
01:07:50 Arbitrator make primary on node = ifx_b successful
01:07:50 ALARM 2002 failover arbitrator automated failover completed
01:07:50 DEBUG[TID139959589746432]:Monitor ifx_a exit by arbitrator  [cmsm_er.ec:cmsm_er_monitor:2789]
01:07:50 DEBUG[TID139959589746432]:all monitors exited, server ifx_a register count = 0 [cmsm_er.ec:cmsm_er_monitor:3156]
01:07:50 DEBUG[TID139959589746432]:enqueue a cleanup task for ifx_a [cmsm_er.ec:cmsm_enqueu_cleanup:2378]
01:07:50 DEBUG[TID139959587645184]:server monitor ifx_b dequeue a cleanup task  [cmsm_er.ec:cmsm_er_event_process:2060]
01:07:50 DEBUG[TID139959587645184]:SQL get ifx_b event SRV_ADM 16:5 SDS,HDR,RSS [cmsm_er.ec:cmsm_er_event_process:1883]
01:07:50 DEBUG[TID139959587645184]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
01:07:50 DEBUG[TID139959587645184]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
01:07:57 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:08:13 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:09:19 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:10:17 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:11:22 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:12:20 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:13:17 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]

因此，存在差异，因为看起来您在提升辅助权限期间遇到了 AF（断言失败）。

你能发布更多日志吗？从关闭主服务器之前到辅助服务器陷入恢复状态。

Answer