红色警报 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启

    ​  ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启。该问题是BUG 10194190 18740837

导致的,修复该问题需要打patch 25142535。有一客户的solaris oracle 11.2.0.4 rac所有节点的DB已经打上补丁,但是

主机运行248天后,跑在上边的Oracle 11.2.0.4依然因集群心跳丢失而重启。根据ORACLE MOS官方回复,修复该BUG

的一个补丁包 patch 18740837是需要同时在GI软件上应用。

    ​  下边是相关客户案例的问题分析及解决处理总结。

    ​   1、问题节点DB告警日志报错提示

红色警报 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启

    ​2、问题节点ASM告警日志报错提示

红色警报 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启

    ​  3、+ASM2_lmhb_4609_i78497.trc文件内容

红色警报 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启

    ​  4、OCSSD日志提示心跳超时

红色警报 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启

    ​  5、+ASM2_lmhb_4609_i78497.trc文件局部信息

===[ Session State Object ]===

  ----------------------------------------

  SO: 0x3ffdb38d8, type: 4, owner: 0x400b0c258, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

   proc=0x400b0c258, name=session, file=ksu.h LINE:12729 ID:, pg=0

  (session) sid: 145 ser: 1 trans: 0x0, creator: 0x400b0c258

            flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-

            flags2: (0x409) -/-/INC

            DID: , short-term DID:

            txn branch: 0x0

            edition#: 0            oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS

  ksuxds FALSE at location: 0

  service name: SYS$BACKGROUND

  Current Wait Stack:

   0: waiting for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476951 seq_num=11521 snap_id=1

      wait times: snap=2 min 46 sec, exc=2 min 46 sec, total=2 min 46 sec

      wait times: max=0.100000 sec, heur=2 min 46 sec

      wait counts: calls=1 os=1

      in_wait=1 iflags=0x5a8

  Wait State:

    fixed_waits=0 flags=0x22 boundary=0x0/-1

  Session Wait History:

      elapsed time of 0.000015 sec since current wait

   0: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476950 seq_num=11520 snap_id=1

      wait times: snap=0.000027 sec, exc=0.000027 sec, total=0.000027 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000138 sec of elapsed time

   1: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476949 seq_num=11519 snap_id=1

      wait times: snap=0.102094 sec, exc=0.102094 sec, total=0.102094 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000015 sec of elapsed time

   2: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476948 seq_num=11518 snap_id=1

      wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000133 sec of elapsed time

   3: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476947 seq_num=11517 snap_id=1

      wait times: snap=0.102074 sec, exc=0.102074 sec, total=0.102074 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000013 sec of elapsed time

   4: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476946 seq_num=11516 snap_id=1

      wait times: snap=0.000023 sec, exc=0.000023 sec, total=0.000023 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000119 sec of elapsed time

   5: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476945 seq_num=11515 snap_id=1

      wait times: snap=0.103086 sec, exc=0.103086 sec, total=0.103086 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000012 sec of elapsed time

   6: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476944 seq_num=11514 snap_id=1

      wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000152 sec of elapsed time

   7: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476943 seq_num=11513 snap_id=1

      wait times: snap=0.103090 sec, exc=0.103090 sec, total=0.103090 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000016 sec of elapsed time

   8: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476942 seq_num=11512 snap_id=1

      wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000272 sec of elapsed time

   9: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476941 seq_num=11511 snap_id=1

      wait times: snap=0.102084 sec, exc=0.102084 sec, total=0.102084 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000013 sec of elapsed time

  Sampled Session History of session 145 serial 1

  ---------------------------------------------------

  The sampled session history is constructed by sampling

  the target session every 1 second. The sampling process

  captures at each sample if the session is in a non-idle wait,

  an idle wait, or not in a wait. If the session is in a

  non-idle wait then one interval is shown for all the samples

  the session was in the same non-idle wait. If the

  session is in an idle wait or not in a wait for

  consecutive samples then one interval is shown for all

  the consecutive samples. Though we display these consecutive

  samples  in a single interval the session may NOT be continuously

  idle or not in a wait (the sampling process does not know).

  The history is displayed in reverse chronological order.

  sample interval: 1 sec, max history 120 sec

  ---------------------------------------------------

    [121 samples,                                          21:08:20 - 21:10:20]

      idle wait at each sample

  temporary object counter: 0

    ----------------------------------------

    Virtual Thread:

    kgskvt: 3fc434ce8, sess: 3ffdb38d8 sid: 145 ser: 1

    vc: 0, proc: 400b0c258, id: 145

    consumer group cur:  (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig:

    vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1

    inwait: 0, short wait event: 0 posted_run: 0

    location where insched last set: kgskthrrun

    location where insched last cleared: kgskthrrun1

    location where inwait last set: NULL

    location where inwait last cleared: NULL

    is_assigned: 0, in_sched: 0 (0)

    qcls: 0, qlink: FALSE

    vt_active: 0 (pending: 0)

    vt_pq_active: 0, dop: 0

    used quanta (usecs):

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu start time: 0

    idle time: 0, active time: 0 (cg: 0)

    cpu yields:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu waits:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu wait time (usecs):

    stmt: 0, accum: 0, mapped: 0, tot: 0

    io waits:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    io wait time:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    ASL queued time outs: 0, time: 0 (cur 0, cg 0)

    PQQ queued time outs: 0, time: 0 (cur 0, cg 0)

    Queue timeout violation: 0

    calls aborted: 0, num est exec limit hit: 0

    undo current: 0k max: 0k

    I/O credits acquired:small=0 large=0

    I/O credits waiting for:small=0 large=0

    KTU Session Commit Cache Dump for IDLs:

    KTU Session Commit Cache Dump for Non-IDLs:

    ----------------------------------------

    KKS-UOL used : 0 locks(used=0, free=0)

    KGX Atomic Operation Log 3eae04a58

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04aa8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04af8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04b48

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    ----------------------------------------

    KGL-UOL SO Cache(total=0, free=0)

    KGX Atomic Operation Log 3eae047a0

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae047f8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae04850

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae048a8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae04900

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae04958

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae049b0

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    ----------------------------------------

    SO: 0x3f464e838, type: 57, owner: 0x3ffdb38d8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

     proc=0x400b0c258, name=dummy, file=ktccts.h LINE:415 ID:, pg=0

    (dummy) nxc=0, nlb=0  

===[ Callstack ]===

*** 2019-05-21 21:10:20.275

Process diagnostic dump for [email protected] (LMON), OS id=4604,

pid: 9, proc_ser: 1, sid: 145, sess_ser: 1

-------------------------------------------------------------------------------

os thread scheduling delay history: (sampling every 1.000000 secs)

  0.000000 secs at [ 21:10:19 ]

    NOTE: scheduling delay has not been sampled for 0.505539 secs  0.000000 secs from [ 21:10:15 - 21:10:20 ], 5 sec avg

  0.000000 secs from [ 21:09:20 - 21:10:20 ], 1 min avg

  0.000000 secs from [ 21:05:21 - 21:10:20 ], 5 min avg

*** 2019-05-21 21:10:21.374

loadavg : 20.48 16.00 13.13

swap info: free_mem = 194124.91M rsv = 258509.04M

           alloc = 239225.71M avail = 235374.84M swap_free = 254658.16M

 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD

 0 O     grid  4604  4596   1  79 20        ? 168184            Sep 15 ?         781:33 asm_lmon_+ASM2

Short stack dump:

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224<-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380

*** 2019-05-21 21:10:22.421

==============================

LMON (ospid: 4604) has not moved for 176 sec (1558444222.1558444046)

kjzduptcctx: Notifying DIAG for crash event

----- Abridged Call Stack Trace -----

ksedsts()+1320<-kjzdicrshnfy()+388<-ksuitm()+1084<-kjgcr_KillInstance()+148<-kjgcr_Main()+7564<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380

----- End of Abridged Call Stack Trace -----

    6、主机启动时间

[email protected]# hostname

ora10

[email protected]# uname -a

SunOS ora10 5.11 11.3 sun4v sparc sun4v

[email protected]# uptime

 11:28pm  up 248 day(s), 15:48,  8 users,  load average: 6.21, 7.27, 8.09

    ​6、根据ORA-15046和ORA-29770查询和TRC 堆栈调用信息提示:

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224<-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380ORACLE MOS官网,发现完全匹配上:(Doc ID 2159643.1)

Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770 

    ​7、后续查看问题系统补丁安装情况,发现rac所有节点的DB已经应用上Doc ID 2159643.1提到的补丁25142535

[email protected]$opatch lsinventory|grep 25142535

Patch  25142535     : applied on Tue Jul 31 16:26:30 GMT+08:00 2018

[email protected]$

[email protected]$opatch lsinventory|grep 25142535

Patch  25142535     : applied on Tue Jul 31 16:26:30 GMT+08:00 2018

[email protected]$

    ​  但是GRID GI没有应用补丁 25142535

[email protected]$ opatch lsinventory|grep  25142535

[email protected]

[email protected]$ opatch lsinventory|grep  25142535

[email protected]

    ​  8、就此问题咨询MOS,ORACLE MOS给出的官方回复也没明确提示补丁25142535确实是否需要在GI上应用

红色警报 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启

    ​  9、补丁25142535是补丁18740837和 10194190的合集,其中18740837小补丁的readme中有提示RAC环境需要在GI应用

红色警报 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启

    ​10、问题分析结论:

    ​由目前的实际问题情况及8和9的信息汇总,可以确定补丁25142535需要在oracle 11.2.0.4 rac for solaris的所有节点的

GI集群软件上应用。可以肯定的是:当前的oracle 11.2.0.4 rac for solaris asm and db crash and restart确实是rac在

solaris上运行248天导致的,在集群所有节点DB软件上打补丁25142535不能解决,官方也没有明确说在GI软件上同时应

用补丁25142535后到底能否解决:

Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770 

    ​

    ​如果有类似问题的客户看到博文,如果有通过GI打补丁25142535解决此问题,或者通过其他途径解决此问题欢迎:

评论留言,感谢!