RAC Testing Scenario
Posted by Mir Sayeed Hassan on September 27th, 2017
RAC Testing Scenario in our Test ENV
Node 1 – 10.20.0.90
Node 2 – 10.20.0.91
FAILURE SCENARIO:
1 – Test case is node failure
It can be planned, unplanned or all the nodes it could be any scenario
Let us start the workload & shutdown the node
The expected result will be:
- Resources will go offline
- Instance recovery will be performed
- Node vip & scan vip fail to surviving node
- Scan listener will fail over surviving node
- Client connection are moved to surviving instance
Now shutdown the node 1 intentionally or due to some reason the node 1 is shutdown
#Shutdown –h now
Check the node 2 status
[oracle@test-rac2 ~]$ crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.DATA.dg ora....up.type ONLINE ONLINE test-rac2 ora....ER.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N1.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N2.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N3.lsnr ora....er.type ONLINE ONLINE test-rac2 ora.asm ora.asm.type ONLINE ONLINE test-rac2 ora.cvu ora.cvu.type ONLINE ONLINE test-rac2 ora.gsd ora.gsd.type OFFLINE OFFLINE ora....network ora....rk.type ONLINE ONLINE test-rac2 ora.oc4j ora.oc4j.type ONLINE ONLINE test-rac2 ora.ons ora.ons.type ONLINE ONLINE test-rac2 ora.rac.db ora....se.type ONLINE ONLINE test-rac2 ora.scan1.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan2.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan3.vip ora....ip.type ONLINE ONLINE test-rac2 ora....ac1.vip ora....t1.type ONLINE ONLINE test-rac2 ora....SM2.asm application ONLINE ONLINE test-rac2 ora....C2.lsnr application ONLINE ONLINE test-rac2 ora....ac2.gsd application OFFLINE OFFLINE ora....ac2.ons application ONLINE ONLINE test-rac2 ora....ac2.vip ora....t1.type ONLINE ONLINE test-rac2
As you can see the above node 2 status its shows only the node 2 is running & node is failure
To get back the node 1 failure as we have shutdown, now need to restart the node 1 therefore the once restarted it will automatically perform all the operation & the database & asm instance will be started
Now after restart the node 1 – verify the crs_stat – t status:
[oracle@test-rac1 ~]$ crs_stat -t Name Type Target State Host ----------------------------------------------------------- ora.DATA.dg ora....up.type ONLINE ONLINE test-rac1 ora....ER.lsnr ora....er.type ONLINE ONLINE test-rac1 ora....N1.lsnr ora....er.type ONLINE ONLINE test-rac1 ora....N2.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N3.lsnr ora....er.type ONLINE ONLINE test-rac2 ora.asm ora.asm.type ONLINE ONLINE test-rac1 ora.cvu ora.cvu.type ONLINE ONLINE test-rac2 ora.gsd ora.gsd.type OFFLINE OFFLINE ora....network ora....rk.type ONLINE ONLINE test-rac1 ora.oc4j ora.oc4j.type ONLINE ONLINE test-rac2 ora.ons ora.ons.type ONLINE ONLINE test-rac1 ora.rac.db ora....se.type ONLINE ONLINE test-rac1 ora.scan1.vip ora....ip.type ONLINE ONLINE test-rac1 ora.scan2.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan3.vip ora....ip.type ONLINE ONLINE test-rac2 ora....SM1.asm application ONLINE ONLINE test-rac1 ora....C1.lsnr application ONLINE ONLINE test-rac1 ora....ac1.gsd application OFFLINE OFFLINE ora....ac1.ons application ONLINE ONLINE test-rac1 ora....ac1.vip ora....t1.type ONLINE ONLINE test-rac1 ora....SM2.asm application ONLINE ONLINE test-rac2 ora....C2.lsnr application ONLINE ONLINE test-rac2 ora....ac2.gsd application OFFLINE OFFLINE ora....ac2.ons application ONLINE ONLINE test-rac2 ora....ac2.vip ora....t1.type ONLINE ONLINE test-rac2 For information check the log files of node 2 & node 1 for better understanding Node 2 – Log after shutdown the node1 SMON: enabling cache recovery Sat Apr 08 12:30:02 2017 minact-scn: Inst 2 is now the master inc#:3 mmon proc-id:6037 status:0x7 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000 minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:3 errcnt:0 [6057] Successfully onlined Undo Tablespace 5. Undo initialization finished serial:0 start:4294856940 end:4294858260 diff:1320 (13 seconds) Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed SMON: enabling tx recovery Database Characterset is AL32UTF8 No Resource Manager plan active Starting background process GTX0 Sat Apr 08 12:30:05 2017 GTX0 started with pid=35, OS id=6139 Starting background process RCBG Sat Apr 08 12:30:06 2017 RCBG started with pid=36, OS id=6141 replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Sat Apr 08 12:30:07 2017 QMNC started with pid=38, OS id=6145 Sat Apr 08 12:30:10 2017 Completed: ALTER DATABASE OPEN /* db agent *//* {1:25782:2} */ Sat Apr 08 12:30:12 2017 Starting background process CJQ0 Sat Apr 08 12:30:12 2017 CJQ0 started with pid=43, OS id=6169 Sat Apr 08 12:35:08 2017 Starting background process SMCO Sat Apr 08 12:35:08 2017 SMCO started with pid=29, OS id=6669 Sat Apr 08 13:00:29 2017 Thread 2 advanced to log sequence 110 (LGWR switch) Current log# 4 seq# 110 mem# 0: +DATA/rac/onlinelog/group_4.267.937069189 Sat Apr 08 15:48:24 2017 Reconfiguration started (old inc 3, new inc 5) List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Apr 08 15:48:24 2017 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Sat Apr 08 15:48:24 2017 minact-scn: master found reconf/inst-rec before recscn scan old-inc#:3 new-inc#:3 Post SMON to start 1st pass IR Sat Apr 08 15:48:24 2017 Instance recovery: looking for dead threads Beginning instance recovery of 1 threads Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Started redo scan Completed redo scan read 56 KB redo, 32 data blocks need recovery Started redo application at Thread 1: logseq 196, block 6940 Recovery of Online Redo Log: Thread 1 Group 2 Seq 196 Reading mem 0 Mem# 0: +DATA/rac/onlinelog/group_2.262.937068817 Completed redo application of 0.02MB Completed instance recovery at Thread 1: logseq 196, block 7053, scn 8509484 31 data blocks read, 32 data blocks written, 56 redo k-bytes read Thread 1 advanced to log sequence 197 (thread recovery) minact-scn: master continuing after IR minact-scn: Master considers inst:1 dead Sat Apr 08 15:49:24 2017 Decreasing number of real time LMS from 1 to 0
Node 1 – After its restart/up the server
[oracle@test-rac1 ~]$ cd /u01/app/oracle/diag/rdbms/rac/rac1/trace/
[oracle@test-rac1 trace]$ tail -100 alert_rac1.log Sat Apr 08 16:01:44 2017 DBW0 started with pid=17, OS id=5925 Sat Apr 08 16:01:45 2017 LGWR started with pid=18, OS id=5928 Sat Apr 08 16:01:45 2017 CKPT started with pid=19, OS id=5930 Sat Apr 08 16:01:45 2017 SMON started with pid=20, OS id=5932 Sat Apr 08 16:01:45 2017 RECO started with pid=21, OS id=5934 Sat Apr 08 16:01:45 2017 RBAL started with pid=22, OS id=5937 Sat Apr 08 16:01:45 2017 ASMB started with pid=23, OS id=5939 Sat Apr 08 16:01:45 2017 MMON started with pid=24, OS id=5941 starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... Sat Apr 08 16:01:45 2017 MMNL started with pid=25, OS id=5944 NOTE: initiating MARK startup starting up 1 shared server(s) ... Starting background process MARK Sat Apr 08 16:01:45 2017 MARK started with pid=27, OS id=5949 NOTE: MARK has subscribed lmon registered with NM - instance number 1 (internal mem no 0) Reconfiguration started (old inc 0, new inc 11) List of instances: 1 2 (myinst: 1) Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished * domain 0 valid according to instance 2 * domain 0 valid = 1 according to instance 2 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Sat Apr 08 16:01:46 2017 LCK0 started with pid=30, OS id=5957 Starting background process RSMN Sat Apr 08 16:01:46 2017 RSMN started with pid=31, OS id=5959 ORACLE_BASE not set in environment. It is recommended that ORACLE_BASE be set in the environment Sat Apr 08 16:01:47 2017 ALTER SYSTEM SET local_listener=' (ADDRESS=(PROTOCOL=TCP)(HOST=10.20.0.92)(PORT=1521))' SCOPE=MEMORY SID='rac1'; ALTER DATABASE MOUNT /* db agent *//* {1:64177:5} */ NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so NOTE: Loaded library: System SUCCESS: diskgroup DATA was mounted NOTE: dependency between database rac and diskgroup resource ora.DATA.dg is established Successful mount of redo thread 1, with mount id 2528695933 Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE) Lost write protection disabled Completed: ALTER DATABASE MOUNT /* db agent *//* {1:64177:5} */ ALTER DATABASE OPEN /* db agent *//* {1:64177:5} */ Picked broadcast on commit scheme to generate SCNs Thread 1 opened at log sequence 198 Current log# 2 seq# 198 mem# 0: +DATA/rac/onlinelog/group_2.262.937068817 Successful open of redo thread 1 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set SMON: enabling cache recovery [5961] Successfully onlined Undo Tablespace 2. Undo initialization finished serial:0 start:4294800940 end:4294801440 diff:500 (5 seconds) Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed Sat Apr 08 16:01:55 2017 SMON: enabling tx recovery Database Characterset is AL32UTF8 No Resource Manager plan active Starting background process GTX0 Sat Apr 08 16:01:55 2017 GTX0 started with pid=35, OS id=6006 Starting background process RCBG Sat Apr 08 16:01:56 2017 RCBG started with pid=36, OS id=6011 replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Sat Apr 08 16:01:56 2017 QMNC started with pid=37, OS id=6033 Sat Apr 08 16:01:58 2017 Completed: ALTER DATABASE OPEN /* db agent *//* {1:64177:5} */ Sat Apr 08 16:01:59 2017 minact-scn: Inst 1 is a slave inc#:11 mmon proc-id:5941 status:0x2 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000 Sat Apr 08 16:01:59 2017 Starting background process CJQ0 Sat Apr 08 16:02:00 2017 CJQ0 started with pid=44, OS id=6066 Sat Apr 08 16:06:58 2017 Starting background process SMCO Sat Apr 08 16:06:58 2017 SMCO started with pid=29, OS id=6514
2 – Instance Failure
Start the workload
Shutdown the instance (Shut abort or kill the PMON Process)
Expected result:
- Instance recovery will be performed
- Surviving instance will read online redo log files of the failure instance and ensures that commitment transaction are recorded into the database
- If all the nodes fail, one instance will perform recovery of all instances
- Services will be moved to available instance
- Client connection are moved to surviving instances
- Failed instance will be restarted by the clusterware automatically
Node 1
[oracle@test-rac1 ~]$ ps -ef | grep pmon oracle 5505 1 0 16:01 ? 00:00:00 asm_pmon_+ASM1 oracle 5867 1 0 16:01 ? 00:00:00 ora_pmon_rac1 oracle 7859 6007 0 16:22 pts/0 00:00:00 grep pmon
[oracle@test-rac1 ~]$ kill -9 5867
[oracle@test-rac1 ~]$ ps -ef | grep pmon oracle 5505 1 0 16:01 ? 00:00:00 asm_pmon_+ASM1 oracle 7928 6007 0 16:23 pts/0 00:00:00 grep pmon
$sqlplus / as sysdba
Sql> select instance_name,status from v$instance; Oracle not available
Exit
=======
Node 1 – Logfile of the database after kill the pmon process, here the instance is shotdown compeletely
=======
[oracle@test-rac1 trace]$ tail -100 alert_rac1.log Sat Apr 08 16:25:49 2017 DBW0 started with pid=17, OS id=8407 Sat Apr 08 16:25:49 2017 LGWR started with pid=18, OS id=8409 Sat Apr 08 16:25:49 2017 CKPT started with pid=19, OS id=8411 Sat Apr 08 16:25:49 2017 SMON started with pid=20, OS id=8413 Sat Apr 08 16:25:49 2017 RECO started with pid=21, OS id=8415 Sat Apr 08 16:25:49 2017 RBAL started with pid=22, OS id=8417 Sat Apr 08 16:25:49 2017 ASMB started with pid=23, OS id=8419 Sat Apr 08 16:25:50 2017 MMON started with pid=24, OS id=8421 NOTE: initiating MARK startup starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... Starting background process MARK Sat Apr 08 16:25:50 2017 MMNL started with pid=25, OS id=8425 Sat Apr 08 16:25:50 2017 MARK started with pid=26, OS id=8427 NOTE: MARK has subscribed starting up 1 shared server(s) ... lmon registered with NM - instance number 1 (internal mem no 0) Reconfiguration started (old inc 0, new inc 19) List of instances: 1 2 (myinst: 1) Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished * domain 0 valid according to instance 2 * domain 0 valid = 1 according to instance 2 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Sat Apr 08 16:25:51 2017 LCK0 started with pid=30, OS id=8439 Starting background process RSMN Sat Apr 08 16:25:51 2017 RSMN started with pid=31, OS id=8441 ORACLE_BASE not set in environment. It is recommended that ORACLE_BASE be set in the environment Sat Apr 08 16:25:52 2017 ALTER SYSTEM SET local_listener=' (ADDRESS=(PROTOCOL=TCP)(HOST=10.20.0.92)(PORT=1521))' SCOPE=MEMORY SID='rac1'; ALTER DATABASE MOUNT /* db agent *//* {0:1:7} */ NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so NOTE: Loaded library: System SUCCESS: diskgroup DATA was mounted NOTE: dependency between database rac and diskgroup resource ora.DATA.dg is established Successful mount of redo thread 1, with mount id 2528695933 Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE) Lost write protection disabled Completed: ALTER DATABASE MOUNT /* db agent *//* {0:1:7} */ ALTER DATABASE OPEN /* db agent *//* {0:1:7} */ Picked broadcast on commit scheme to generate SCNs Thread 1 opened at log sequence 200 Current log# 2 seq# 200 mem# 0: +DATA/rac/onlinelog/group_2.262.937068817 Successful open of redo thread 1 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set SMON: enabling cache recovery minact-scn: Inst 1 is a slave inc#:19 mmon proc-id:8421 status:0x2 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000 [8443] Successfully onlined Undo Tablespace 2. Undo initialization finished serial:0 start:1278034 end:1278424 diff:390 (3 seconds) Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed SMON: enabling tx recovery Database Characterset is AL32UTF8 No Resource Manager plan active Starting background process GTX0 Sat Apr 08 16:25:59 2017 GTX0 started with pid=35, OS id=8465 Starting background process RCBG Sat Apr 08 16:26:00 2017 RCBG started with pid=36, OS id=8467 replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Sat Apr 08 16:26:00 2017 QMNC started with pid=37, OS id=8469 Completed: ALTER DATABASE OPEN /* db agent *//* {0:1:7} */ Sat Apr 08 16:26:02 2017 Starting background process CJQ0 Sat Apr 08 16:26:02 2017 CJQ0 started with pid=42, OS id=8496 Sat Apr 08 16:27:27 2017 Shutting down instance (abort) License high water mark = 4 USER (ospid: 8600): terminating the instance Instance terminated by USER, pid = 8600 Sat Apr 08 16:27:28 2017 Instance shutdown complete
===========
Node 2
===========
Check the alert log as the database of the node 1 instance will be recover automatically
[oracle@test-rac2 trace]$ tail -100 alert_rac2.log Instance recovery: looking for dead threads Beginning instance recovery of 1 threads Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Started redo scan Fix write in gcs resources Reconfiguration complete Completed redo scan read 180 KB redo, 54 data blocks need recovery Started redo application at Thread 1: logseq 203, block 107 Recovery of Online Redo Log: Thread 1 Group 1 Seq 203 Reading mem 0 Mem# 0: +DATA/rac/onlinelog/group_1.261.937068817 Completed redo application of 0.04MB Completed instance recovery at Thread 1: logseq 203, block 468, scn 8677383 53 data blocks read, 54 data blocks written, 180 redo k-bytes read Thread 1 advanced to log sequence 204 (thread recovery) Sat Apr 08 16:44:08 2017 minact-scn: Master considers inst:1 dead Sat Apr 08 16:45:06 2017 Decreasing number of real time LMS from 1 to 0 Sat Apr 08 16:47:32 2017 Reconfiguration started (old inc 33, new inc 35) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablished Sat Apr 08 16:47:32 2017 * domain 0 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Apr 08 16:47:32 2017 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Sat Apr 08 16:47:35 2017 minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:35 errcnt:0 Sat Apr 08 16:48:30 2017 Dumping diagnostic data in directory=[cdmp_20170408164830], requested by (instance=1, osid=10869 (LMD0)), summary=[abnormal instance termination]. Sat Apr 08 16:48:31 2017 Reconfiguration started (old inc 35, new inc 37) List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Apr 08 16:48:31 2017 LMS 0: 1 GCS shadows cancelled, 1 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Sat Apr 08 16:48:31 2017 Instance recovery: looking for dead threads Beginning instance recovery of 1 threads Started redo scan Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Completed redo scan read 56 KB redo, 47 data blocks need recovery Started redo application at Thread 1: logseq 204, block 2, scn 8677834 Recovery of Online Redo Log: Thread 1 Group 2 Seq 204 Reading mem 0 Mem# 0: +DATA/rac/onlinelog/group_2.262.937068817 Completed redo application of 0.03MB Completed instance recovery at Thread 1: logseq 204, block 115, scn 8699281 35 data blocks read, 47 data blocks written, 56 redo k-bytes read Thread 1 advanced to log sequence 205 (thread recovery) Sat Apr 08 16:48:32 2017 minact-scn: Master considers inst:1 dead Reconfiguration started (old inc 37, new inc 39) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablished Sat Apr 08 16:48:41 2017 * domain 0 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:39 errcnt:0
[oracle@test-rac2 trace]$ tail -100 alert_rac2.log Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Started redo scan Fix write in gcs resources Reconfiguration complete Completed redo scan read 180 KB redo, 54 data blocks need recovery Started redo application at Thread 1: logseq 203, block 107 Recovery of Online Redo Log: Thread 1 Group 1 Seq 203 Reading mem 0 Mem# 0: +DATA/rac/onlinelog/group_1.261.937068817 Completed redo application of 0.04MB Completed instance recovery at Thread 1: logseq 203, block 468, scn 8677383 53 data blocks read, 54 data blocks written, 180 redo k-bytes read Thread 1 advanced to log sequence 204 (thread recovery) Sat Apr 08 16:44:08 2017 minact-scn: Master considers inst:1 dead Sat Apr 08 16:45:06 2017 Decreasing number of real time LMS from 1 to 0 Sat Apr 08 16:47:32 2017 Reconfiguration started (old inc 33, new inc 35) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablished Sat Apr 08 16:47:32 2017 * domain 0 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Apr 08 16:47:32 2017 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Sat Apr 08 16:47:35 2017 minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:35 errcnt:0 Sat Apr 08 16:48:30 2017 Dumping diagnostic data in directory=[cdmp_20170408164830], requested by (instance=1, osid=10869 (LMD0)), summary=[abnormal instance termination]. Sat Apr 08 16:48:31 2017 Reconfiguration started (old inc 35, new inc 37) List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Apr 08 16:48:31 2017 LMS 0: 1 GCS shadows cancelled, 1 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Sat Apr 08 16:48:31 2017 Instance recovery: looking for dead threads Beginning instance recovery of 1 threads Started redo scan Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Completed redo scan read 56 KB redo, 47 data blocks need recovery Started redo application at Thread 1: logseq 204, block 2, scn 8677834 Recovery of Online Redo Log: Thread 1 Group 2 Seq 204 Reading mem 0 Mem# 0: +DATA/rac/onlinelog/group_2.262.937068817 Completed redo application of 0.03MB Completed instance recovery at Thread 1: logseq 204, block 115, scn 8699281 35 data blocks read, 47 data blocks written, 56 redo k-bytes read Thread 1 advanced to log sequence 205 (thread recovery) Sat Apr 08 16:48:32 2017 minact-scn: Master considers inst:1 dead Reconfiguration started (old inc 37, new inc 39) List of instances: 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablished Sat Apr 08 16:48:41 2017 * domain 0 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:39 errcnt:0 Sat Apr 08 16:49:49 2017 Increasing number of real time LMS from 0 to 1
3 – ASM INSTANCE FAILURE
Start the workload
- Kill the PMON Process of the ASM Instance
Expected Result
- ASM Resource will be offline and will be automatically restarted by the cluster & the database in that instance will be shutdown abnormally
- Instance recovery will be performed by reading the disk group log
- Client connections are moved to surviving instances
- Services will be moved to available instance
Example:
[oracle@test-rac1 ~]$ ps -ef | grep pmon oracle 5505 1 0 Apr08 ? 00:00:11 asm_pmon_+ASM1 oracle 11104 1 0 Apr08 ? 00:00:13 ora_pmon_rac1 oracle 19210 19169 0 11:49 pts/0 00:00:00 grep pmon Kill the asm pmon process Meanwhile go to the asm alert log file location as shown below
[oracle@test-rac1 trace]$ cd /u01/app/oracle/diag/asm/+asm/+ASM1/trace
[oracle@test-rac1 trace]$ pwd /u01/app/oracle/diag/asm/+asm/+ASM1/trace
Check the alert log file in node 1
[oracle@test-rac1 trace]$ tail -f alert_+ASM1.log NOTE: check client alert log. NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_9969.trc Sat Apr 08 16:47:32 2017 NOTE: client rac1:rac registered, osid 10900, mbr 0x1 Sat Apr 08 16:48:31 2017 NOTE: ASM client rac1:rac disconnected unexpectedly. NOTE: check client alert log. NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_10900.trc Sat Apr 08 16:48:40 2017 NOTE: client rac1:rac registered, osid 11154, mbr 0x1 Sun Apr 09 11:59:05 2017 LMON (ospid: 5521): terminating the instance due to error 472 Sun Apr 09 11:59:05 2017 System state dump requested by (instance=1, osid=5521 (LMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_diag_5515_20170409115905.trc Dumping diagnostic data in directory=[cdmp_20170409115905], requested by (instance=1, osid=5521 (LMON)), summary=[abnormal instance termination]. Instance terminated by LMON, pid = 5521 Sun Apr 09 11:59:08 2017 MEMORY_TARGET defaulting to 1128267776. * instance_number obtained from CSS = 1, checking for the existence of node 0... * node 0 does not exist. instance_number = 1 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 1 Private Interface 'eth2:1' configured from GPnP for use as a private interconnect. [name='eth2:1', type=1, ip=169.254.35.129, mac=00-50-56-b0-38-83, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] Public Interface 'eth1' configured from GPnP for use as a public interface. [name='eth1', type=1, ip=10.20.0.90, mac=00-50-56-b0-77-f1, net=10.20.0.0/24, mask=255.255.255.0, use=public/1] Public Interface 'eth1:1' configured from GPnP for use as a public interface. [name='eth1:1', type=1, ip=10.20.0.92, mac=00-50-56-b0-77-f1, net=10.20.0.0/24, mask=255.255.255.0, use=public/1] Public Interface 'eth1:2' configured from GPnP for use as a public interface. [name='eth1:2', type=1, ip=10.20.0.94, mac=00-50-56-b0-77-f1, net=10.20.0.0/24, mask=255.255.255.0, use=public/1] CELL communication is configured to use 0 interface(s): CELL IP affinity details: NUMA status: non-NUMA system cellaffinity.ora status: N/A CELL communication will use 1 IP group(s): Grp 0: Picked latch-free SCN scheme 3 Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/11.2.0/grid/dbs/arch Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options ORACLE_HOME = /u01/app/11.2.0/grid System name: Linux Node name: test-rac1.local Release: 3.8.13-68.3.4.el6uek.x86_64 Version: #2 SMP Tue Jul 14 15:03:36 PDT 2015 Machine: x86_64 VM name: VMWare Version: 6 Using parameter settings in server-side spfile +DATA/ractest-scan/asmparameterfile/registry.253.937065775 System parameters with non-default values: large_pool_size = 12M instance_type = "asm" remote_login_passwordfile= "EXCLUSIVE" asm_power_limit = 1 diagnostic_dest = "/u01/app/oracle" Cluster communication is configured to use the following interface(s) for this instance 169.254.35.129 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 Sun Apr 09 11:59:09 2017 PMON started with pid=2, OS id=20252 Sun Apr 09 11:59:09 2017 PSP0 started with pid=3, OS id=20254 Sun Apr 09 11:59:10 2017 VKTM started with pid=4, OS id=20256 at elevated priority VKTM running at (1)millisec precision with DBRM quantum (100)ms Sun Apr 09 11:59:10 2017 GEN0 started with pid=5, OS id=20260 Sun Apr 09 11:59:10 2017 DIAG started with pid=6, OS id=20262 Sun Apr 09 11:59:10 2017 PING started with pid=7, OS id=20264 Sun Apr 09 11:59:10 2017 DIA0 started with pid=8, OS id=20266 Sun Apr 09 11:59:10 2017 LMON started with pid=9, OS id=20268 Sun Apr 09 11:59:10 2017 LMD0 started with pid=10, OS id=20270 * Load Monitor used for high load check * New Low - High Load Threshold Range = [960 - 1280] Sun Apr 09 11:59:10 2017 LMS0 started with pid=11, OS id=20272 at elevated priority Sun Apr 09 11:59:10 2017 LMHB started with pid=12, OS id=20276 Sun Apr 09 11:59:10 2017 MMAN started with pid=13, OS id=20278 Sun Apr 09 11:59:10 2017 DBW0 started with pid=14, OS id=20280 Sun Apr 09 11:59:10 2017 LGWR started with pid=15, OS id=20282 Sun Apr 09 11:59:10 2017 CKPT started with pid=16, OS id=20284 Sun Apr 09 11:59:10 2017 SMON started with pid=17, OS id=20286 Sun Apr 09 11:59:10 2017 BAL started with pid=18, OS id=20288 un Apr 09 11:59:10 2017 GMON started with pid=19, OS id=20290 Sun Apr 09 11:59:10 2017 MMON started with pid=20, OS id=20292 Sun Apr 09 11:59:10 2017 MMNL started with pid=21, OS id=20294 lmon registered with NM - instance number 1 (internal mem no 0) Reconfiguration started (old inc 0, new inc 16) ASM instance List of instances: 1 2 (myinst: 1) Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished * allocate domain 1, invalid = TRUE * domain 0 valid = 1 according to instance 2 * domain 1 valid = 1 according to instance 2 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Sun Apr 09 11:59:11 2017 LCK0 started with pid=22, OS id=20297 ORACLE_BASE not set in environment. It is recommended that ORACLE_BASE be set in the environment Sun Apr 09 11:59:12 2017 SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:9:5} */ NOTE: Diskgroup used for Voting files is: DATA Diskgroup with spfile:DATA Diskgroup used for OCR is:DATA NOTE: cache registered group DATA number=1 incarn=0x1156f512 NOTE: cache began mount (not first) of group DATA number=1 incarn=0x1156f512 NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so NOTE: Assigning number (1,0) to disk (ORCL:DISK1) NOTE: Assigning number (1,1) to disk (ORCL:DISK2) NOTE: Assigning number (1,2) to disk (ORCL:DISK3) GMON querying group 1 at 2 for pid 23, osid 20303 OTE: cache opening disk 0 of grp 1: DISK1 label:DISK1 NOTE: F1X0 found on disk 0 au 2 fcn 0.0 NOTE: cache opening disk 1 of grp 1: DISK2 label:DISK2 NOTE: cache opening disk 2 of grp 1: DISK3 label:DISK3 NOTE: cache mounting (not first) external redundancy group 1/0x1156F512 (DATA) kjbdomatt send to inst 2 NOTE: attached to recovery domain 1 NOTE: redo buffer size is 256 blocks (1053184 bytes) NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (DATA) Process LGWR (pid 20282) is running at high priority QoS for Exadata I/O NOTE: LGWR found thread 1 closed at ABA 9.741 NOTE: LGWR mounted thread 1 for diskgroup 1 (DATA) NOTE: LGWR opening thread 1 at fcn 0.5550 ABA 10.742 NOTE: cache mounting group 1/0x1156F512 (DATA) succeeded NOTE: cache ending mount (success) of group DATA number=1 incarn=0x1156f512 NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1 SUCCESS: diskgroup DATA was mounted SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:9:5} */ SQL> ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:9:5} */ SUCCESS: ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:9:5} */ NOTE: diskgroup resource ora.DATA.dg is online Sun Apr 09 11:59:13 2017 ALTER SYSTEM SET local_listener=' (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.20.0.92)(PORT=1521))))' SCOPE=MEMORY SID='+ASM1'; NOTE: Attempting voting file refresh on diskgroup DATA NOTE: Refresh completed on diskgroup DATA . Found 1 voting file(s). NOTE: Voting file relocation is required in diskgroup DATA NOTE: Attempting voting file relocation on diskgroup DATA NOTE: Successful voting file relocation on diskgroup DATA Sun Apr 09 11:59:17 2017 Starting background process ASMB Sun Apr 09 11:59:17 2017 ASMB started with pid=26, OS id=20345 Sun Apr 09 11:59:17 2017 NOTE: client +ASM1:+ASM registered, osid 20347, mbr 0x0 Sun Apr 09 11:59:19 2017 NOTE: client rac1:rac registered, osid 20411, mbr 0x1 The recovery of the ASM instance will start from the node 2 as shown below in alert log of the node 2
[oracle@test-rac2 trace]$ tail -f alert_+ASM2.log LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Sun Apr 09 11:59:05 2017 Dumping diagnostic data in directory=[cdmp_20170409115905], requested by (instance=1, osid=5521 (LMON)), summary=[abnormal instance termination]. un Apr 09 11:59:07 2017 econfiguration started (old inc 12, new inc 14) List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 1 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sun Apr 09 11:59:07 2017 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Sun Apr 09 11:59:07 2017 NOTE: SMON starting instance recovery for group DATA domain 1 (mounted) NOTE: F1X0 found on disk 0 au 2 fcn 0.0 NOTE: starting recovery of thread=1 ckpt=9.742 group=1 (DATA) NOTE: SMON waiting for thread 1 recovery enqueue NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DATA) NOTE: SMON successfully validated lock domain 1 NOTE: advancing ckpt for group 1 (DATA) thread=1 ckpt=9.742 NOTE: SMON did instance recovery for group DATA domain 1 Reconfiguration started (old inc 14, new inc 16) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablished Sun Apr 09 11:59:11 2017 * domain 0 valid = 1 according to instance 1 * domain 1 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete
Hence once the asm failure – It will automatically recover by node 2 and all the asm disk will be mounted & started automatically
4 – Local & SCAN Listener Failure
Kill local listener process
Kill scan listener
Expected result:
- New connection will be redirected to the second listener
- Listener failure will be detected by CRSD & restarted automatically
5 – Public Network Failure
Unplug public network cable or bring the public network part down by using the OS Level
Expected Result:
- VIP & SCAN VIP will fail to the surviving node
- DB Instance will be up & DB Service will fail to surviving node
- If TAF configured, client should fail to available instance
Now are are going to down the Public network ip down & see the changes
[root@test-rac1 ~]# /sbin/ip link set eth1 down
In this case once the public ip will be done, All the client connection will be disconnected & we will not be able to access the server until we up the services
TO up the service
[root@test-rac1 ~]# /sbin/ip link set eth1 up
After the scan, vip network service will be re-directed to node 2 automatically
The database of the node 1 is open but the listener is unreachable only on node 1 where us the node 2 the database & listener is up & running
Log before the network down
[oracle@test-rac1 trace]$ crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.DATA.dg ora....up.type ONLINE ONLINE test-rac1 ora....ER.lsnr ora....er.type ONLINE ONLINE test-rac1 ora....N1.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N2.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N3.lsnr ora....er.type ONLINE ONLINE test-rac2 ora.asm ora.asm.type ONLINE ONLINE test-rac1 ora.cvu ora.cvu.type ONLINE ONLINE test-rac2 ora.gsd ora.gsd.type OFFLINE OFFLINE ora....network ora....rk.type ONLINE ONLINE test-rac1 ora.oc4j ora.oc4j.type ONLINE ONLINE test-rac2 ora.ons ora.ons.type ONLINE ONLINE test-rac1 ora.rac.db ora....se.type ONLINE ONLINE test-rac1 ora.scan1.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan2.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan3.vip ora....ip.type ONLINE ONLINE test-rac2 ora....SM1.asm application ONLINE ONLINE test-rac1 ora....C1.lsnr application ONLINE ONLINE test-rac1 ora....ac1.gsd application OFFLINE OFFLINE ora....ac1.ons application ONLINE ONLINE test-rac1 ora....ac1.vip ora....t1.type ONLINE ONLINE test-rac1 ora....SM2.asm application ONLINE ONLINE test-rac2 ora....C2.lsnr application ONLINE ONLINE test-rac2 ora....ac2.gsd application OFFLINE OFFLINE ora....ac2.ons application ONLINE ONLINE test-rac2 ora....ac2.vip ora....t1.type ONLINE ONLINE test-rac2 Log after the network down
[oracle@test-rac2 ~]$ crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.DATA.dg ora....up.type ONLINE ONLINE test-rac1 ora....ER.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N1.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N2.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N3.lsnr ora....er.type ONLINE ONLINE test-rac2 ora.asm ora.asm.type ONLINE ONLINE test-rac1 ora.cvu ora.cvu.type ONLINE ONLINE test-rac2 ora.gsd ora.gsd.type OFFLINE OFFLINE ora....network ora....rk.type ONLINE ONLINE test-rac2 ora.oc4j ora.oc4j.type ONLINE ONLINE test-rac2 ora.ons ora.ons.type ONLINE ONLINE test-rac2 ora.rac.db ora....se.type ONLINE ONLINE test-rac1 ora.scan1.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan2.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan3.vip ora....ip.type ONLINE ONLINE test-rac2 ora....SM1.asm application ONLINE ONLINE test-rac1 ora....C1.lsnr application ONLINE OFFLINE ora....ac1.gsd application OFFLINE OFFLINE ora....ac1.ons application ONLINE OFFLINE ora....ac1.vip ora....t1.type ONLINE ONLINE test-rac2 ora....SM2.asm application ONLINE ONLINE test-rac2 ora....C2.lsnr application ONLINE ONLINE test-rac2 ora....ac2.gsd application OFFLINE OFFLINE ora....ac2.ons application ONLINE ONLINE test-rac2 ora....ac2.vip ora....t1.type ONLINE ONLINE test-rac2
6 – Private Network Failure
Unplug private network cable or bring the private network part down by using the OS Level
Expected Result:
- The private network failure is very critical to have the fail
- CSSD will detect a split brain situation & will survive the node with the lowest node number, second node will be evicted
- The CRS, ASM & DB instance will shutdown
- All the process will be terminated, if not the node will be rebooted
- After reconnect, CRS Stack & Resources will be started
In this case the node 1 of private network is made to down & the node 2 will be rebooted automatically & check the status of the resources on node 1 so all the resources are fail over in node 1
Before the server fail
[oracle@test-rac1 trace]$ crs_stat -t Name Type Target State Host ----------------------------------------------------------- ora.DATA.dg ora....up.type ONLINE ONLINE test-rac1 ora....ER.lsnr ora....er.type ONLINE ONLINE test-rac1 ora....N1.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N2.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N3.lsnr ora....er.type ONLINE ONLINE test-rac2 ora.asm ora.asm.type ONLINE ONLINE test-rac1 ora.cvu ora.cvu.type ONLINE ONLINE test-rac2 ora.gsd ora.gsd.type OFFLINE OFFLINE ora....network ora....rk.type ONLINE ONLINE test-rac1 ora.oc4j ora.oc4j.type ONLINE ONLINE test-rac2 ora.ons ora.ons.type ONLINE ONLINE test-rac1 ora.rac.db ora....se.type ONLINE ONLINE test-rac1 ora.scan1.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan2.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan3.vip ora....ip.type ONLINE ONLINE test-rac2 ora....SM1.asm application ONLINE ONLINE test-rac1 ora....C1.lsnr application ONLINE ONLINE test-rac1 ora....ac1.gsd application OFFLINE OFFLINE ora....ac1.ons application ONLINE ONLINE test-rac1 ora....ac1.vip ora....t1.type ONLINE ONLINE test-rac1 ora....SM2.asm application ONLINE ONLINE test-rac2 ora....C2.lsnr application ONLINE ONLINE test-rac2 ora....ac2.gsd application OFFLINE OFFLINE ora....ac2.ons application ONLINE ONLINE test-rac2 ora....ac2.vip ora....t1.type ONLINE ONLINE test-rac2 #/sbin/ip link set eth2 down
[oracle@test-rac2 ~]$ crs_stat -t Name Type Target State Host ----------------------------------------------------------- ora.DATA.dg ora....up.type ONLINE ONLINE test-rac1 ora....ER.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N1.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N2.lsnr ora....er.type ONLINE ONLINE test-rac2 ora....N3.lsnr ora....er.type ONLINE ONLINE test-rac2 ora.asm ora.asm.type ONLINE ONLINE test-rac1 ora.cvu ora.cvu.type ONLINE ONLINE test-rac2 ora.gsd ora.gsd.type OFFLINE OFFLINE ora....network ora....rk.type ONLINE ONLINE test-rac2 ora.oc4j ora.oc4j.type ONLINE ONLINE test-rac2 ora.ons ora.ons.type ONLINE ONLINE test-rac2 ora.rac.db ora....se.type ONLINE ONLINE test-rac1 ora.scan1.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan2.vip ora....ip.type ONLINE ONLINE test-rac2 ora.scan3.vip ora....ip.type ONLINE ONLINE test-rac2 ora....SM1.asm application ONLINE ONLINE test-rac1 ora....C1.lsnr application ONLINE OFFLINE ora....ac1.gsd application OFFLINE OFFLINE ora....ac1.ons application ONLINE OFFLINE ora....ac1.vip ora....t1.type ONLINE ONLINE test-rac2 ora....SM2.asm application ONLINE ONLINE test-rac2 ora....C2.lsnr application ONLINE ONLINE test-rac2 ora....ac2.gsd application OFFLINE OFFLINE ora....ac2.ons application ONLINE ONLINE test-rac2 ora....ac2.vip ora....t1.type ONLINE ONLINE test-rac2 If the node 1 Private network is down the node 2 database instance & network will fail, The node 1 database is up but the listener is down, after the server gets rebooted evert thing will be started automatically
[oracle@test-rac2 ~]$ crs_stat -t CRS-0184: Cannot communicate with the CRS daemon.
SQL> select status from v$instance; select status from v$instance * ERROR at line 1: ORA-03135: connection lost contact Process ID: 17034 Session ID: 50 Serial numbers: 2969
If we bring down the private interconnect network of the second node – it does not matter private network of the 1st node fail or 2nd node fail – oracle will reboot the 2nd node only always
If the private network of the Node 2 fail or down then the node 2 itself will be restarted once up the server
#/sbin/ip link set eth2 up
[oracle@test-rac1 ~]$ crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.DATA.dg ora....up.type ONLINE ONLINE test-rac1 ora....ER.lsnr ora....er.type ONLINE ONLINE test-rac1 ora....N1.lsnr ora....er.type ONLINE ONLINE test-rac1 ora....N2.lsnr ora....er.type ONLINE ONLINE test-rac1 ora....N3.lsnr ora....er.type ONLINE ONLINE test-rac1 ora.asm ora.asm.type ONLINE ONLINE test-rac1 ora.cvu ora.cvu.type ONLINE ONLINE test-rac1 ora.gsd ora.gsd.type OFFLINE OFFLINE ora....network ora....rk.type ONLINE ONLINE test-rac1 ora.oc4j ora.oc4j.type ONLINE OFFLINE ora.ons ora.ons.type ONLINE ONLINE test-rac1 ora.rac.db ora....se.type ONLINE ONLINE test-rac1 ora.scan1.vip ora....ip.type ONLINE ONLINE test-rac1 ora.scan2.vip ora....ip.type ONLINE ONLINE test-rac1 ora.scan3.vip ora....ip.type ONLINE ONLINE test-rac1 ora....SM1.asm application ONLINE ONLINE test-rac1 ora....C1.lsnr application ONLINE ONLINE test-rac1 ora....ac1.gsd application OFFLINE OFFLINE ora....ac1.ons application ONLINE ONLINE test-rac1 ora....ac1.vip ora....t1.type ONLINE ONLINE test-rac1 ora....ac2.vip ora....t1.type ONLINE ONLINE test-rac1 OCR & Voting disk failure
Make sure you set two different diskgroups (normal redundancy) to store OCR:
One for the main storage, second for the OCR mirror
Imaging we have two disk group, if we loss the whole ocr disk group & it will be loss one of the main ocr storage & our clusterwere will be up end running
Use at least normal redundancy diskgroup with 3 failure group to keep voting disk files
In order to able to tolerate a failure of n voting files, one must have at least 2n+1 configured (n=number of voting files) for the cluster
Before creating the voting disk check the existing disk in to node 1 & fallow below seteps
[oracle@test-rac1 ~]$ cd /dev/oracleasm/disks/
[oracle@test-rac1 disks]$ ls DISK1 DISK2 DISK3