In a high-traffic database, it is inevitable that concurrent sessions will block each other. One blocked session can subsequently block other sessions, resulting in a chain of blocking sessions when blockages occur.

Depending on how long the block has been lifted, information on blocking issues can be retrieved from three different tables:

  • GV$SESSION: Provides real-time information about all sessions, including inactive ones.
  • GV$ACTIVE_SESSION_HISTORY: Captures recent sessions activity samples in memory.
  • DBA_HIST_ACTIVE_SESS_HISTORY: Stores historical sessions activity samples persisted in disks.

These tables share three columns for identifying blocking issues:

When the value of the BLOCKING_SESSION_STATUS column is VALID, the BLOCKING_INSTANCEBLOCKING_SESSION, and BLOCKING_SESSION_SERIAL# columns contain the instance ID, session ID, and serial number of the blocking session, respectively.

By utilizing these tables and columns, blocking issues can be effectively identified and troubleshooted. Here is an example of the process to identify the chain of blocking sessions with ASH.

One of my clients complained that the database became very slow between midnight and 2 AM, even though the workload during this time is usually low. I start by querying the distribution of wait events during this time period from ASH:

SELECT
    NVL(event, 'ON CPU') AS event,
    NVL(wait_class, 'CPU') AS wait_class,
    COUNT(*) AS total_wait_time,
    TO_CHAR(100 * TRUNC(RATIO_TO_REPORT(COUNT(*)) OVER (), 6), 'FM990.9999') || '%' AS "%"
FROM
    dba_hist_active_sess_history
WHERE
    sample_time BETWEEN TRUNC(SYSDATE) + INTERVAL '0 00:00:00' DAY TO SECOND
                    AND TRUNC(SYSDATE) + INTERVAL '0 02:00:00' DAY TO SECOND
GROUP BY
    event,
    wait_class
ORDER BY
    total_wait_time DESC
FETCH FIRST 10 ROWS ONLY;

The output of the query revealed the following:

EVENT                        WAIT_CLASS        TOTAL_WAIT_TIME %
log file switch completion   Configuration               99213 56.9754%
log file sync                Commit                      25953 14.9041%
gc buffer busy release       Cluster                     21965 12.6139%
gc buffer busy acquire       Cluster                     10520 6.0413%
buffer busy waits            Concurrency                  5309 3.0488%
log buffer space             Configuration                3371 1.9358%
ON CPU                       CPU                          1544 0.8866%
db file sequential read      User I/O                     1215 0.6977%
gcs log flush sync           Other                         916 0.526%
gc cr request                Cluster                       625 0.3589%

10 rows selected.

From the output, we can see that log file switch completion and log file sync were the dominant wait events, accounting for 56.9754% and 14.9041% respectively. Typical high-percentage events like CPU and I/O waits (such as ON CPU and db file sequential read) had much lower percentages in this period. I suspect there might be a blockage in accessing the log files, so we use the following query to check blocked sessions:

SELECT
    blocking_session || ':' || blocking_session_serial# || '@' || blocking_inst_id AS blocking_sess,
    blocking_session_status,
    COUNT(*)
FROM dba_hist_active_sess_history
WHERE
    blocking_session_status = 'VALID' AND
    blocking_session IS NOT NULL AND
    sample_time BETWEEN TRUNC(SYSDATE) + INTERVAL '0 00:00:00' DAY TO SECOND
                    AND TRUNC(SYSDATE) + INTERVAL '0 02:00:00' DAY TO SECOND AND
    event IN ('log file switch completion', 'log file sync')
GROUP BY
    blocking_session || ':' || blocking_session_serial# || '@' || blocking_inst_id,
    blocking_session_status
ORDER BY
    COUNT(*);

The output of the blocked sessions query is as follows:

BLOCKING_SESS   BLOCKING_SESSION_STATUS     COUNT(*)
3367:1@1        VALID                          19511
3367:1@2        VALID                          95938

It shows that on both the first and second nodes, sessions with session ID 3367 caused a significant amount of blocking in log file access. To gain more insight, I further query details of these blocking sessions:

SELECT
    instance_number,
    program,
    event,
    blocking_session || ':' || blocking_session_serial# || '@' || blocking_inst_id AS blocking_sess,
    blocking_session_status,
    COUNT(*)
FROM
    dba_hist_active_sess_history
WHERE
    SESSION_ID = 3367 AND
    SESSION_SERIAL# = 1 AND
    sample_time BETWEEN TRUNC(SYSDATE) + INTERVAL '0 00:00:00' DAY TO SECOND
                  AND TRUNC(SYSDATE) + INTERVAL '0 02:00:00' DAY TO SECOND
GROUP BY
    instance_number,
    program,
    event,
    blocking_session || ':' || blocking_session_serial# || '@' || blocking_inst_id,
    blocking_session_status
ORDER BY
    COUNT(*);

The details of these blocking sessions are as follows:

  INSTANCE_NUMBER PROGRAM                  EVENT                     BLOCKING_SESS   BLOCKING_SESSION_STATUS     COUNT(*)
                2 oracle@yxgldb02 (LGWR)   enq: CF - contention      6181:29725@2    VALID                             13
                1 oracle@yxgldb01 (LGWR)   enq: CF - contention      6181:29725@2    VALID                             16
                2 oracle@yxgldb02 (LGWR)   log file parallel write   :@              NO HOLDER                         28
                2 oracle@yxgldb02 (LGWR)   enq: CF - contention      17754:31107@1   VALID                             61

We see that these two sessions were LGWR background processes. Most of their time was spent waiting for enq: CF - contention, indicating contention for the Control File Enqueue. I continue to investigate the blocker sessions with the following query:

SELECT
    session_id || ':' || session_serial# || '@' || instance_number sess,
    program,
    event,
    blocking_session || ':' || blocking_session_serial# || '@' || blocking_inst_id AS blocking_sess,
    blocking_session_status,
    COUNT(*)
FROM
    dba_hist_active_sess_history
WHERE
    (
        (INSTANCE_NUMBER = 2 AND SESSION_ID = 6181 AND SESSION_SERIAL# = 29725)
        OR (INSTANCE_NUMBER = 1 AND SESSION_ID = 17754 AND SESSION_SERIAL# = 31107)
    )
    AND sample_time BETWEEN TRUNC(SYSDATE) + INTERVAL '0 00:00:00' DAY TO SECOND
                        AND TRUNC(SYSDATE) + INTERVAL '0 02:00:00' DAY TO SECOND
GROUP BY
    session_id || ':' || session_serial# || '@' || instance_number,
    program,
    event,
    blocking_session || ':' || blocking_session_serial# || '@' || blocking_inst_id,
    blocking_session_status
ORDER BY
    COUNT(*);

The output of this further investigation is as follows:

SESS           PROGRAM                     EVENT                             BLOCKING_SESS   BLOCKING_SESSION_STATUS     COUNT(*)
6181:29725@2   rman@yxgldb02 (TNS V1-V3)   enq: CF - contention              5509:32237@1    VALID                              1
6181:29725@2   rman@yxgldb02 (TNS V1-V3)   ges inquiry response              613:1@2         VALID                              1
6181:29725@2   rman@yxgldb02 (TNS V1-V3)   Backup: MML delete backup piece   :@              NO HOLDER                          1
6181:29725@2   rman@yxgldb02 (TNS V1-V3)   Backup: MML query backup piece    :@              NO HOLDER                          1
6181:29725@2   rman@yxgldb02 (TNS V1-V3)   enq: CF - contention              4608:13189@1    VALID                              2
6181:29725@2   rman@yxgldb02 (TNS V1-V3)   Backup: MML command to channel    :@              NO HOLDER                          3
6181:29725@2   rman@yxgldb02 (TNS V1-V3)   enq: CF - contention              :@              GLOBAL                             5
6181:29725@2   rman@yxgldb02 (TNS V1-V3)   control file parallel write       :@              UNKNOWN                            5
6181:29725@2   rman@yxgldb02 (TNS V1-V3)   control file sequential read      :@              UNKNOWN                           33

9 rows selected.

We do not see the blocker session on the first node, as the sampling interval of data in DBA_HIST_ACTIVE_SESS_HISTORY view is ten seconds, so a session not captured by ASH can still be an enqueue holder. The blocking session on the second node was an RMAN backup job. Typically, the majority of events for a backup job should be related to I/O operations on the datafiles. However, this particular backup job spent a significant amount of time accessing the control file.

Further investigation reveals that the control file, which is typically a few tens of megabytes, had grown to a daunting size of 300GB. This was due to the Oracle parameter control_file_record_keep_time being accidentally set to 365 days.

SQL> show parameter control_file_reco
NAME                          TYPE    VALUE
----------------------------- ------- -----
control_file_record_keep_time integer 365


This excessive setting caused the control file to grow too large, slowing down RMAN’s access to it and causing a chain of blockages.

Here is the blocking chain breakdown:

  • RMAN Process: Took too long to access the control file, blocking the LGWR process from writing to the control file (enq: CF - contention wait event).
  • Blocked LGWR Process: Blocked numerous DML transactions that need LGWR to write to the log file during commit (log file switch completion and log file sync wait events).
  • Uncommitted Transactions: Blocked other transactions, leading to gc buffer busy releasegc buffer busy acquire, and buffer busy waits events.

Below is a visual representation of blocking chain:

+----------------------------------------------------------+
|                          RMAN Process                    |
|                                                          |
| Wait Event: control file sequential read                 |
+----------------------------------------------------------+
                               |
                               |
                               v
+----------------------------------------------------------+
|                          LGWR Process                    |
|                                                          |
| Wait Event: enq: CF - contention                         |
+----------------------------------------------------------+
                               |
                               |
                               v
+----------------------------------------------------------+
|                    DML Transactions                      |
|                                                          |
| Wait Events: log file switch completion, log file sync   |
+----------------------------------------------------------+
                               |
                               |
                               v
+------------------------------------------------------------+
|         Uncommitted Transactions Blocking Others           |
|                                                            |
| Wait Events:gc buffer busy release, gc buffer busy acquire,|
| buffer busy waits                                          |
+------------------------------------------------------------+

In summary, the troubleshooting process reveals that the root cause of the performance issue was an excessively large control file, which led to significant wait times and blockages.

Leave a comment

I’m Yuan

Welcome to the blog of a performance maven in Oracle & MySQL. Feel free to contact me to share and discuss database knowledge.

My 20 Oracle Certifications

A screenshot from certview.oracle.com

My Book

MySQL 8.0 Operations and Optimization

Let’s connect