红色警报 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启
ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启。该问题是BUG 10194190 18740837
导致的,修复该问题需要打patch 25142535。有一客户的solaris oracle 11.2.0.4 rac所有节点的DB已经打上补丁,但是
主机运行248天后,跑在上边的Oracle 11.2.0.4依然因集群心跳丢失而重启。根据ORACLE MOS官方回复,修复该BUG
的一个补丁包 patch 18740837是需要同时在GI软件上应用。
下边是相关客户案例的问题分析及解决处理总结。
1、问题节点DB告警日志报错提示
2、问题节点ASM告警日志报错提示
3、+ASM2_lmhb_4609_i78497.trc文件内容
4、OCSSD日志提示心跳超时
5、+ASM2_lmhb_4609_i78497.trc文件局部信息
===[ Session State Object ]===
----------------------------------------
SO: 0x3ffdb38d8, type: 4, owner: 0x400b0c258, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x400b0c258, name=session, file=ksu.h LINE:12729 ID:, pg=0
(session) sid: 145 ser: 1 trans: 0x0, creator: 0x400b0c258
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x409) -/-/INC
DID: , short-term DID:
txn branch: 0x0
edition#: 0 oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
0: waiting for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476951 seq_num=11521 snap_id=1
wait times: snap=2 min 46 sec, exc=2 min 46 sec, total=2 min 46 sec
wait times: max=0.100000 sec, heur=2 min 46 sec
wait counts: calls=1 os=1
in_wait=1 iflags=0x5a8
Wait State:
fixed_waits=0 flags=0x22 boundary=0x0/-1
Session Wait History:
elapsed time of 0.000015 sec since current wait
0: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476950 seq_num=11520 snap_id=1
wait times: snap=0.000027 sec, exc=0.000027 sec, total=0.000027 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000138 sec of elapsed time
1: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476949 seq_num=11519 snap_id=1
wait times: snap=0.102094 sec, exc=0.102094 sec, total=0.102094 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000015 sec of elapsed time
2: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476948 seq_num=11518 snap_id=1
wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000133 sec of elapsed time
3: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476947 seq_num=11517 snap_id=1
wait times: snap=0.102074 sec, exc=0.102074 sec, total=0.102074 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000013 sec of elapsed time
4: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476946 seq_num=11516 snap_id=1
wait times: snap=0.000023 sec, exc=0.000023 sec, total=0.000023 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000119 sec of elapsed time
5: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476945 seq_num=11515 snap_id=1
wait times: snap=0.103086 sec, exc=0.103086 sec, total=0.103086 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000012 sec of elapsed time
6: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476944 seq_num=11514 snap_id=1
wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000152 sec of elapsed time
7: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476943 seq_num=11513 snap_id=1
wait times: snap=0.103090 sec, exc=0.103090 sec, total=0.103090 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000016 sec of elapsed time
8: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476942 seq_num=11512 snap_id=1
wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000272 sec of elapsed time
9: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476941 seq_num=11511 snap_id=1
wait times: snap=0.102084 sec, exc=0.102084 sec, total=0.102084 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000013 sec of elapsed time
Sampled Session History of session 145 serial 1
---------------------------------------------------
The sampled session history is constructed by sampling
the target session every 1 second. The sampling process
captures at each sample if the session is in a non-idle wait,
an idle wait, or not in a wait. If the session is in a
non-idle wait then one interval is shown for all the samples
the session was in the same non-idle wait. If the
session is in an idle wait or not in a wait for
consecutive samples then one interval is shown for all
the consecutive samples. Though we display these consecutive
samples in a single interval the session may NOT be continuously
idle or not in a wait (the sampling process does not know).
The history is displayed in reverse chronological order.
sample interval: 1 sec, max history 120 sec
---------------------------------------------------
[121 samples, 21:08:20 - 21:10:20]
idle wait at each sample
temporary object counter: 0
----------------------------------------
Virtual Thread:
kgskvt: 3fc434ce8, sess: 3ffdb38d8 sid: 145 ser: 1
vc: 0, proc: 400b0c258, id: 145
consumer group cur: (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig:
vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1
inwait: 0, short wait event: 0 posted_run: 0
location where insched last set: kgskthrrun
location where insched last cleared: kgskthrrun1
location where inwait last set: NULL
location where inwait last cleared: NULL
is_assigned: 0, in_sched: 0 (0)
qcls: 0, qlink: FALSE
vt_active: 0 (pending: 0)
vt_pq_active: 0, dop: 0
used quanta (usecs):
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu start time: 0
idle time: 0, active time: 0 (cg: 0)
cpu yields:
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu waits:
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu wait time (usecs):
stmt: 0, accum: 0, mapped: 0, tot: 0
io waits:
stmt: 0, accum: 0, mapped: 0, tot: 0
io wait time:
stmt: 0, accum: 0, mapped: 0, tot: 0
ASL queued time outs: 0, time: 0 (cur 0, cg 0)
PQQ queued time outs: 0, time: 0 (cur 0, cg 0)
Queue timeout violation: 0
calls aborted: 0, num est exec limit hit: 0
undo current: 0k max: 0k
I/O credits acquired:small=0 large=0
I/O credits waiting for:small=0 large=0
KTU Session Commit Cache Dump for IDLs:
KTU Session Commit Cache Dump for Non-IDLs:
----------------------------------------
KKS-UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log 3eae04a58
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
KGX Atomic Operation Log 3eae04aa8
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
KGX Atomic Operation Log 3eae04af8
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
KGX Atomic Operation Log 3eae04b48
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
----------------------------------------
KGL-UOL SO Cache(total=0, free=0)
KGX Atomic Operation Log 3eae047a0
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae047f8
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae04850
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae048a8
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae04900
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae04958
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae049b0
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
----------------------------------------
SO: 0x3f464e838, type: 57, owner: 0x3ffdb38d8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x400b0c258, name=dummy, file=ktccts.h LINE:415 ID:, pg=0
(dummy) nxc=0, nlb=0
===[ Callstack ]===
*** 2019-05-21 21:10:20.275
Process diagnostic dump for [email protected] (LMON), OS id=4604,
pid: 9, proc_ser: 1, sid: 145, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 21:10:19 ]
NOTE: scheduling delay has not been sampled for 0.505539 secs 0.000000 secs from [ 21:10:15 - 21:10:20 ], 5 sec avg
0.000000 secs from [ 21:09:20 - 21:10:20 ], 1 min avg
0.000000 secs from [ 21:05:21 - 21:10:20 ], 5 min avg
*** 2019-05-21 21:10:21.374
loadavg : 20.48 16.00 13.13
swap info: free_mem = 194124.91M rsv = 258509.04M
alloc = 239225.71M avail = 235374.84M swap_free = 254658.16M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 O grid 4604 4596 1 79 20 ? 168184 Sep 15 ? 781:33 asm_lmon_+ASM2
Short stack dump:
ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224<-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380
*** 2019-05-21 21:10:22.421
==============================
LMON (ospid: 4604) has not moved for 176 sec (1558444222.1558444046)
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+1320<-kjzdicrshnfy()+388<-ksuitm()+1084<-kjgcr_KillInstance()+148<-kjgcr_Main()+7564<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380
----- End of Abridged Call Stack Trace -----
6、主机启动时间
[email protected]# hostname
ora10
[email protected]# uname -a
SunOS ora10 5.11 11.3 sun4v sparc sun4v
[email protected]# uptime
11:28pm up 248 day(s), 15:48, 8 users, load average: 6.21, 7.27, 8.09
6、根据ORA-15046和ORA-29770查询和TRC 堆栈调用信息提示:
ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224<-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380ORACLE MOS官网,发现完全匹配上:(Doc ID 2159643.1)
Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770
7、后续查看问题系统补丁安装情况,发现rac所有节点的DB已经应用上Doc ID 2159643.1提到的补丁25142535
[email protected]$opatch lsinventory|grep 25142535
Patch 25142535 : applied on Tue Jul 31 16:26:30 GMT+08:00 2018
[email protected]$opatch lsinventory|grep 25142535
Patch 25142535 : applied on Tue Jul 31 16:26:30 GMT+08:00 2018
但是GRID GI没有应用补丁 25142535
[email protected]$ opatch lsinventory|grep 25142535
[email protected]$ opatch lsinventory|grep 25142535
8、就此问题咨询MOS,ORACLE MOS给出的官方回复也没明确提示补丁25142535确实是否需要在GI上应用
9、补丁25142535是补丁18740837和 10194190的合集,其中18740837小补丁的readme中有提示RAC环境需要在GI应用
10、问题分析结论:
由目前的实际问题情况及8和9的信息汇总,可以确定补丁25142535需要在oracle 11.2.0.4 rac for solaris的所有节点的
GI集群软件上应用。可以肯定的是:当前的oracle 11.2.0.4 rac for solaris asm and db crash and restart确实是rac在
solaris上运行248天导致的,在集群所有节点DB软件上打补丁25142535不能解决,官方也没有明确说在GI软件上同时应
用补丁25142535后到底能否解决:
Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770
如果有类似问题的客户看到博文,如果有通过GI打补丁25142535解决此问题,或者通过其他途径解决此问题欢迎:
评论留言,感谢!