記一次生產環境卡頓優化過程：大事務并發回滾

作者：波波說運維 2019-08-15 11:30:06

最近生產環境有這么個現象，平時的訂單調度只需要2s內可以出結果，但是多個人調度就會卡住，超過15分鐘都沒有結果出來，有時還會失敗然后導致數據不準確。

概述

下面記錄一下生產環境卡頓時排查的過程。

1、獲取ASH報告

SQL> @?/rdbms/admin/ashrpt.sql 
--To specify absolute begin time: 
--[MM/DD/YY]] HH24:MI[:SS] 
--08/09/19 08:40:00

2、ASH分析

（1）Top User Events

（2）相關sql

Top SQL with Top Events

sql明細

（3）存儲過程

（4）TOP sessions

從上面分析可以看到兩個明顯的等待事件：wait for stopper event to be increased 等待事件和wait for a undo record 等待事件，這個應該是批量任務調度的時候產生了大量的大事務，產生了一些回滾造成了嚴重的資源消耗

3、處理大事務并發回滾

一般情況下wait for stopper event to be increased 等待事件是跟wait for a undo record 等待事件聯系起來的。

對于這個等待事件metalink上面有一篇文檔

464246.1 
Sometimes Parallel Rollback of Large Transaction may become very slow. After killing a large running transaction  
(either by killing the shadow process or aborting the database) then database seems to hang, or smon and parallel query servers  
taking all the available cpu. 
In fast-start parallel rollback, the background process Smon acts as a coordinator and rolls back a set of transactions in parallel  
using multiple server processes. Fast start parallel rollback is mainly useful when a system has transactions that run a long time  
before comitting, especially parallel Inserts, Updates, Deletes operations. When Smon discovers that the amount of recovery work is  
above a certain threshold, it automatically begins parallel rollback by dispersing the work among several parallel processes. 
There are cases where parallel transaction recovery is not as fast as serial transaction recovery, because the pq slaves are interfering 
with each other. It looks like the changes made by this transaction cannot be recovered in parallel without causing a performance problem.  
The parallel rollback slave processes are most likely contending for the same resource, which results in even worse rollback performance  
compared to a serial rollback.

解決的辦法：

--關掉并發回滾，變成串行回滾（直接重啟解決） 
sql> alter system set fast_start_parallel_rollback = false scope=spfile;

通常，如果有很多并發進程，可以根據v$px_session視圖去查看，查看v$px_session視圖，發現所有的并發進程都是由smon進程導致(即qcsid列為smon進程的session id)

而smon進程的等待事件為wait for stopper event to be increased

即smon進程在做大事務的回滾，默認參數fast_start_parallel_rollback參數為low，即回滾時會啟動2*CPU個數個并發進程。而由于是使用并發，所以可能由于并發之間相互使用共同的資源，導致回滾速度更慢。因為是生產環境，不能隨便重啟，所以我用了下面的方法來修改這個參數：

（1）查找smon進程ID

select pid,spid,pname,username,tracefile from v$process where pname='SMON'

（2）禁用smon進程的事務清理(Disable SMON transaction cleanup)

oradebug setorapid 'SMON's Oracle PID'; 
 oradebug event 10513 trace name context forever, level 2

（3）查詢V$FAST_START_SERVERS視圖，將所有smon啟用的并發進程殺掉

（4）修改fast_start_parallel_rollback參數

alter system set fast_start_parallel_rollback=false;

（5）啟用smon進程的事務清理(enable transaction recovery)

oradebug setorapid 'SMON's Oracle PID'; 
oradebug event 10513 trace name context off

（6）獲得tracefile name

oradebug tracefile_name

（7）驗證

4、業務驗證

修改后去業務驗證，到高峰期還是有卡頓現象，不過頻率減少了很多，報錯之類的也沒有了，同時觀察新的報告可以發現并發回滾之類的等待事件已經沒有了。

責任編輯：華軒來源：今日頭條

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

記一次生產環境卡頓優化過程：大事務并發回滾