停止NFS服务引发的一系列故障

停止NFS服务引发的一系列故障

1.df -h和ls -l挂住

原因是:racj1节点1有mount了另外一台机器232.100的nfs挂接点。而服务器端的nfs服务因为安全加固要求关闭nfs服务,所以在racj1节点1没有先umount掉nfs目录的情况下,直接停止了服务器端的nfs服务导致了rac节点1的挂死现象。

此时需要在开racj1窗口,然后用mount命令查看nfs挂接点情况:有服务器的nfs目录挂接点在racj1上。

然后强制fuser -ck /mnt后发现racj1连接都被断开,过来5秒后重新连接,df –h等命令正常执行。(安全的做法:这里其实应该将服务器端的nfs服务重启启动,保证客户端racj1先正常,再umount掉就不会引起后续的一系列故障了,没想到fuser -ck会引起这么大的问题

 

此时发现racj1节点1的rac群集服务当掉,数据库ora_进程也全部消失。

停止NFS服务引发的一系列故障

 

root尝试手动启动crs,1分钟后群集正常:

/oracle/app/11.2.0/grid/bin/crsctl start crs

 停止NFS服务引发的一系列故障

此时发现/ogg目录在节点1没有正常挂起,而节点2是挂着的。OGG采用的是acfs共享群集文件系统。

尝试启动均失败。

[[email protected] ~]$ asmcmd

ASMCMD> volinfo -a

Diskgroup Name: OGGDG

 

         Volume Name: OGGVOL

         Volume Device: /dev/asm/oggvol-141

         State: ENABLED

         Size (MB): 409600

         Resize Unit (MB): 32

         Redundancy: UNPROT

         Stripe Columns: 4

         Stripe Width (K): 128

         Usage: ACFS

         Mountpath: /ogg

执行/oracle/app/11.2.0/grid/bin/srvctl stop filesystem -d /dev/asm/oggvol-141

此时节点2/ogg目录也卸载。

尝试启动,但提示失败。

[[email protected] ~]# /oracle/app/11.2.0/grid/bin/srvctl start filesystem -d /dev/asm/oggvol-141

PRCR-1079 : Failed to start resource ora.oggdg.oggvol.acfs

CRS-5016: Process "/oracle/app/11.2.0/grid/bin/acfssinglefsmount" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"

CRS-5016: Process "/sbin/acfsutil" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"

CRS-2674: Start of 'ora.oggdg.oggvol.acfs' on 'racj1' failed

[[email protected] ~]# /oracle/app/11.2.0/grid/bin/srvctl stop filesystem -d /dev/asm/oggvol-141

[[email protected] ~]# /oracle/app/11.2.0/grid/bin/srvctl start filesystem -d /dev/asm/oggvol-141

PRCR-1079 : Failed to start resource ora.oggdg.oggvol.acfs

CRS-5016: Process "/oracle/app/11.2.0/grid/bin/acfssinglefsmount" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"

CRS-5016: Process "/oracle/app/11.2.0/grid/bin/acfssinglefsmount" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj2/agent/crsd/orarootagent_root//orarootagent_root.log"

CRS-5016: Process "/sbin/acfsutil" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"

CRS-2674: Start of 'ora.oggdg.oggvol.acfs' on 'racj1' failed

CRS-5016: Process "/sbin/acfsutil" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj2/agent/crsd/orarootagent_root//orarootagent_root.log"

CRS-2674: Start of 'ora.oggdg.oggvol.acfs' on 'racj2' failed

[[email protected] ~]# more

 

停止NFS服务引发的一系列故障

查看日志其实是有关键错误的,只是当时没注意:

停止NFS服务引发的一系列故障

 

[[email protected] ~]$ srvctl stop filesystem -d /dev/asm/oggvol-141

 [[email protected] ~]$ acfsutil registry -f -a /dev/asm/oggvol-141 /ogg

acfsutil registry: CLSU-00100: Operating System function: open64 failed with error data: 13

acfsutil registry: CLSU-00101: Operating System error message: Permission denied

acfsutil registry: CLSU-00103: error location: OOF_1

acfsutil registry: CLSU-00104: additional error information: open64 (/dev/asm/oggvol-141)

acfsutil registry: ACFS-03141: unable to open device /dev/asm/oggvol-141

此时怀疑权限有问题,对比节点1,2果然发现不对:

检查发现racj1节点的/dev/asm/oggvol-141的权限不对了。

[[email protected] orarootagent_root]# ls -l /dev/asm/oggvol-141

brw------- 1 root root 251, 72193 Apr 25 09:18 /dev/asm/oggvol-141

[[email protected] orarootagent_root]# chown root:asmadmin /dev/asm/oggvol-141

[[email protected] orarootagent_root]# ls -l /dev/asm/oggvol-141

brw------- 1 root asmadmin 251, 72193 Apr 25 09:18 /dev/asm/oggvol-141

[[email protected] orarootagent_root]# chmod 770 /dev/asm/oggvol-141

[[email protected] orarootagent_root]# ls -l /dev/asm/oggvol-141

brwxrwx--- 1 root asmadmin 251, 72193 Apr 25 09:18 /dev/asm/oggvol-141

crsctl status resource –t检查发现ora.oggdg.oggvol.acfs是offline的。

停止NFS服务引发的一系列故障

尝试启动失败:

停止NFS服务引发的一系列故障

尝试重启acfs服务还是失败:

crsctl stop resource ora.oggdg.oggvol.acfs

crsctl start resource ora.oggdg.oggvol.acfs

停止NFS服务引发的一系列故障

尝试用root手动挂接报错:

mount -t acfs -rw /dev/asm/oggvol-141 /ogg

[[email protected] orarootagent_root]# mount -t acfs -rw /dev/asm/oggvol-141 /ogg

mount: wrong fs type, bad option, bad superblock on /dev/asm/oggvol-141,

       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try

       dmesg | tail  or so

 

mount.acfs: CLSU-00100: Operating System function: mount failed with error data: 22

mount.acfs: CLSU-00101: Operating System error message: Invalid argument

mount.acfs: CLSU-00103: error location: MOUNT_3

mount.acfs: ACFS-02126: Volume /dev/asm/oggvol-141 cannot be mounted.

 

使用dmesg查看,发现关键提示:

ACFSK-0021: FSCK-NEEDED set for volume /dev/asm/oggvol-141 . Internal ACFS Location 838 .

停止NFS服务引发的一系列故障

根据提示执行fsck命令成功:

[[email protected] ~]# /sbin/fsck -a -v -y -t acfs /dev/asm/oggvol-141

[[email protected] ~]# su - grid

[[email protected] ~]$ crsctl start resource ora.oggdg.oggvol.acfs

停止NFS服务引发的一系列故障

目录挂接成功后,继续启动ogg的操作。

2.ogg启动报错丢失归档: 

停止NFS服务引发的一系列故障

view report gdcq查看报错:

[/ogg/12c/extract(ggs::gglib::MultiThreading::MainThread::ExecMain()+0x60) [0x752c80]]

                          : [/ogg/12c/extract(ggs::gglib::MultiThreading::Thread::RunThread(ggs::gglib::MultiThreading::Thread::ThreadArgs*)+0x14d) [0x753d5d]]

                          : [/ogg/12c/extract(ggs::gglib::MultiThreading::MainThread::Run(int, char**)+0xb1) [0x753e41]]

                          : [/ogg/12c/extract(main+0x3b) [0x6eff1b]]

                          : [/lib64/libc.so.6(__libc_start_main+0xfd) [0x3396a1ed1d]]

                          : [/ogg/12c/extract() [0x69aed1]]

 

2019-04-25 10:12:39  ERROR   OGG-00446  Opening file +ARCHDG/2_5183_986573398.dbf in DBLOGREADER mode: (308) ORA-00308: cannot open archived log '+ARCHDG/2_5183_986573398.dbf'

ORA-17503: ksfdopn:2 Failed to open file +ARCHDG/2_5183_986573398.dbf

ORA-15173: entry '2_5183_986573398.dbf' does not exist in directory '/'

Not able to establish initial position for sequence 5183, rba 1626514448.

 

2019-04-25 10:12:39  ERROR   OGG-01668  PROCESS ABENDING.

 

由于当前部署了每4小时备份一次归档到带库,然后删除的策略。导致ogg恢复的时候刚好归档没了。

2.1检查当前在线日志和归档日志情况:

[[email protected] ~]# su - grid

[[email protected] ~]$ asmcmd

ASMCMD> ls     

ARCHDG/

CRSDG/

DATADG/

OGGDG/

ASMCMD> cd archdg

ASMCMD> ls

GDDB/

ASMCMD> cd gddb

ASMCMD> ls

ARCHIVELOG/

ASMCMD> cd archivelog

ASMCMD> ls

2019_04_25/

ASMCMD> cd 2019*

ASMCMD> ls

thread_1_seq_7409.702.1006510689

ASMCMD> ls

thread_1_seq_7409.702.1006510689

 

SQL> set line 132 wrap off

SQL> select * from v$Log;

truncating (as requested) before column NEXT_CHANGE#

 

 

    GROUP#    THREAD#  SEQUENCE#      BYTES  BLOCKSIZE    MEMBERS ARC STATUS           FIRST_CHANGE# FIRST_TIME          NEXT_TIME

---------- ---------- ---------- ---------- ---------- ---------- --- ---------------- ------------- ------------------- -----------

         1          1       7409 2147483648        512          1 YES ACTIVE              1.5742E+13 2019-04-25 09:56:19 2019-04-25

         2          1       7407 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 09:13:43 2019-04-25

         3          1       7410 2147483648        512          1 NO  CURRENT             1.5742E+13 2019-04-25 10:18:08

         4          1       7408 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 09:19:05 2019-04-25

         5          1       7406 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 08:37:07 2019-04-25

         6          2       5193 2147483648        512          1 NO  CURRENT             1.5742E+13 2019-04-25 09:56:17

         7          2       5189 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 08:37:04 2019-04-25

         8          2       5190 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 08:52:05 2019-04-25

         9          2       5191 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 09:17:26 2019-04-25

        10          2       5192 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 09:47:27 2019-04-25

 

10 rows selected.

2.2登录rac节点2检查nbu备份,并计划恢复丢失的归档

 

/usr/openv/netbackup/bin/bplist -S 'nbujxq' -C 'racj2' -t 4 -R -l /

 

 

-rw-rw---- oracle    asmadmin     5052160K 4月 25 09:56 /al_2879_1_1006509383

-rw-rw---- oracle    asmadmin     8343552K 4月 25 09:56 /al_2878_1_1006509383

-rw-rw---- oracle    asmadmin     7269376K 4月 25 09:56 /al_2877_1_1006509383

-rw-rw---- oracle    asmadmin     8310016K 4月 25 09:56 /al_2876_1_1006509383

 

RUN {

allocate channel D1 type 'sbt_tape' parms 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64';

allocate channel D2 type 'sbt_tape' parms 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64';

send 'NB_ORA_SERV=nbujxq,NB_ORA_CLIENT=racj2';

restore archivelog from logseq 5183 until logseq 5193 thread 2;

restore archivelog from logseq 7390 until logseq 7408 thread 1;

RELEASE CHANNEL D1;

RELEASE CHANNEL D2;

}

 

停止NFS服务引发的一系列故障

停止NFS服务引发的一系列故障

2.3检查恢复情况:

停止NFS服务引发的一系列故障

停止NFS服务引发的一系列故障

2.4启动OGG抽取

停止NFS服务引发的一系列故障

停止NFS服务引发的一系列故障

3.后续建议:

3.1关闭nfs服务前,应先检查哪些机器挂接了目录showmount -a,先umount掉。

3.2出现问题需冷静仔细查看输出的日志,包括操作系统日志。

3.3对于有ogg的服务器,归档日志不建议快速备份删除,一般因保留1天后可备份删除,避免丢失需要大量的恢复时间。