Tuesday, November 24, 2015

Database hung troubleshooting - Hanganalyze

What can we do when the whole database appears to be hung? This post describes hanganalyze and how it can be used to get more information on why the database is hung.

Hang Analyze

Hanganalyze provides information on processes involved in a hang chain. It is recommended, where possible to get 2 hanganalyze dumps in order to determine whether processes are moving at all or whether they are really hung.

There are many levels of the hanganalyze command. Level 3 collects a short stack trace for processes in the hang chain. Level 4 additionally dumps the leaf nodes (database session processes) in the chain. Setting the level any higher should probably only be used when requested by Oracle support.

When using a Multitenant database and only one PDB is hung, then you should run the hanganalyze from the PDB level. If all PDB's are stuck, then run hanganalyze from root container.

$ sqlplus / as sysdba SQL> oradebug setmypid Statement processed. SQL> oradebug unlimit Statement processed. SQL> oradebug hanganalyze 3 Hang Analysis in /home/oracle/app/oracle/diag/rdbms/cdb1/cdb1/trace/cdb1_ora_6612.trc SQL> oradebug hanganalyze 3 Hang Analysis in /home/oracle/app/oracle/diag/rdbms/cdb1/cdb1/trace/cdb1_ora_6612.trc SQL> exit

For a RAC database the command would be:

SQL> oradebug -g all hanganalyze 3

Depending on the issue, the trace file is fairly straight forward to understand. First there is a header section on where the hanganalyse was run, instance, level, etc. Then it list information about the most likely cause of any hangs.

*** 2015-11-21 23:18:01.466 =============================================================================== HANG ANALYSIS: instances (db_name.oracle_sid): cdb1.cdb1 oradebug_node_dump_level: 3 analysis initiated by oradebug os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 23:18:01 ] NOTE: scheduling delay has not been sampled for 0.283622 secs 0.000000 secs from [ 23:17:57 - 23:18:02 ], 5 sec avg 0.000000 secs from [ 23:17:02 - 23:18:02 ], 1 min avg 0.000264 secs from [ 23:13:01 - 23:18:02 ], 5 min avg vktm time drift history =============================================================================== Chains most likely to have caused the hang: [a] Chain 1 Signature: 'SQL*Net message from client'<='enq: TX - row lock contention' Chain 1 Signature Hash: 0x38c48850

Under the "Non-intersecting chains" section the trace shows the details for the sessions related to the most likely hang chain. The session(s) which are hung are detailed including the session identifiers and the wait event and info. Along with the current sql that the sessions are running.

------------------------------------------------------------------------------- Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (cdb1.cdb1) os id: 6522 process id: 80, oracle@devdb.example.com session id: 26 session serial #: 37281 pdb id: 3 (ORCL) } is waiting for 'enq: TX - row lock contention' with wait info: { p1: 'name|mode'=0x54580006 p2: 'usn<<16 .="" 0="" 15="" 25="" 6="" after:="" blocking:="" col1="1" current="" div="" id:="" in="" min="" never="" p3:="" sec="" sequence="" sessions="" set="" slot="" sql:="" time="" timeout="" update="" wait:="" wait="" wait_chain_test="" x9000a="" xb1e="">

The trace file then shows which the hung sessions are blocked by or "hung on". So from this trace file, the database is not actually hung, there is just 1 blocking session. Maybe a database users session got blocked and they decided the whole database was hung.

and is blocked by => Oracle session identified by: { instance: 1 (cdb1.cdb1) os id: 5827 process id: 77, oracle@devdb.example.com session id: 133 session serial #: 65076 pdb id: 3 (ORCL) } which is waiting for 'SQL*Net message from client' with wait info: { p1: 'driver id'=0x54435000 p2: '#bytes'=0x1 time in wait: 22 min 47 sec timeout after: never wait id: 74 blocking: 1 session current sql:

An additional part of the trace file shows what additional traces will be dumped if you use a higher level of hanganalyze. Thus we can predict the impact of using a higher level of hanganalyze.

=============================================================================== Extra information that will be dumped at higher levels: [level 4] : 1 node dumps -- [LEAF] [LEAF_NW] [level 5] : 1 node dumps -- [NO_WAIT] [INVOL_WT] [SINGLE_NODE] [NLEAF] [SINGLE_NODE_NW]


Taking a look at another hanganalyze trace file, this time the database is hung on the FRA being full. This time a level 4 dump was taken. The "Chains most likey to have caused the hang" section shows sessions hung on 'rdbms ipc message' and 'log file switch (archiving needed)' and 'control file parallel write'

=============================================================================== HANG ANALYSIS: instances (db_name.oracle_sid): cdb1.cdb1 oradebug_node_dump_level: 4 analysis initiated by oradebug os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 05:18:55 ] NOTE: scheduling delay has not been sampled for 0.433364 secs 0.000000 secs from [ 05:18:51 - 05:18:56 ], 5 sec avg 0.000000 secs from [ 05:17:56 - 05:18:56 ], 1 min avg 0.000000 secs from [ 05:13:56 - 05:18:56 ], 5 min avg vktm time drift history =============================================================================== Chains most likely to have caused the hang: [a] Chain 1 Signature: 'rdbms ipc message'<='log file switch (archiving needed)'<='buffer busy waits' Chain 1 Signature Hash: 0xe6b594df [b] Chain 2 Signature: 'control file parallel write'<='enq: CF - contention' Chain 2 Signature Hash: 0xe82e6a14 [c] Chain 3 Signature: 'rdbms ipc message'<='log file switch (archiving needed)' Chain 3 Signature Hash: 0xfc07a352

Looking at the details in the non-intersecting/intersecting chains sections in the trace file It shows that most sessions are blocked by the following session (LGWR)

and is blocked by => Oracle session identified by: { instance: 1 (cdb1.cdb1) os id: 3872 process id: 13, oracle@devdb.example.com (LGWR) session id: 121 session serial #: 28831 pdb id: 1 (CDB$ROOT) } which is waiting for 'rdbms ipc message' with wait info: . . . Chain 3... and is blocked by 'instance: 1, os id: 3872, session id: 121', which is a member of 'Chain 1'. Chain 4... and is blocked by 'instance: 1, os id: 3872, session id: 121', which is a member of 'Chain 1'.

The trace file also has a section for the level 4 information where it shows the additional processes that where dumped to trace. The two sessions that were additionally dumped are LGWR and a ARCn session.

=============================================================================== HANG ANALYSIS DUMPS: oradebug_node_dump_level: 4 =============================================================================== . . . Dumping call stack for process with ospid 3872 Dumping detailed process information (fixed sga, context, etc.) for ospid 3872 dump location: /home/oracle/app/oracle/diag/rdbms/cdb1/cdb1/trace/cdb1_lgwr_3872.trc *** 2015-11-22 05:18:56.814 Dumping process information for ospid 3966 dump location: /home/oracle/app/oracle/diag/rdbms/cdb1/cdb1/trace/cdb1_arc3_3966.trc

Hanganalyze trace files will also show information about the 'state' of nodes. In the trace file a node represents sequential number used by hanganalyze to identify each session.

=============================================================================== State of LOCAL nodes ([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]): [25]/1/26/37281/0x90fddf68/6522/NLEAF/[132] [132]/1/133/65076/0x90943d58/5827/LEAF/

The states are usually one of the following: (the below information is from MOS Doc ID 215858.1)

NLEAF : These sessions are usually considered as “stuck” sessions. It means that there is another session holding a resource needed by the session in this state. By using the adjlist, you can determine which node is the blocker of this process.

LEAF and LEAF_NW: Leaf nodes are considered on top of the wait chain (usually blockers). They are considered “Blockers” when there is another session waiting. This can be easily identified using the “predecesor” field. If there is a node referenced in the ‘prdecessor’ field, the node is considered as “blocker”, otherwise it is considered as a “slow” session waiting for some resource. The difference between LEAF and LEAF_NW is that LEAF nodes are waiting for something, while LEAF_NW are not waiting or may be using the CPU.

IN_HANG: This might be considered as the most critical STATE. Basically a node in this state is involved in a deadlock, or is hung. Usually there will be another “adjacent node” in the same status.

prelim connections

When a database is completely hung, and even sysdba connections are unable to connect, we can connect using the '-prelim' option

$ sqlplus -prelim sys@cdb1 as sysdba

Although from 11.2.0.2 onwards you can't run a hanganalyze from a prelim connection. Hoever once we are connected, using oradebug 'setorapname' we can connect to a particular process In this case the diag bg process, or any database process using 'setospid' (Thanks Tanel for letting me know about this).

SQL> oradebug setorapname diag Oracle pid: 8, Unix process pid: 4191, image: oracle@devdb.example.com (DIAG) SQL> oradebug unlimit Statement processed. SQL> oradebug dump hanganalyze 3 Statement processed.


******************************************
keywords: hanganalyze
******************************************
rdbms version: 12c
******************************************

No comments: