記一次生產環境卡頓優化過程:大事務并發回滾
概述
最近生產環境有這么個現象,平時的訂單調度只需要2s內可以出結果,但是多個人調度就會卡住,超過15分鐘都沒有結果出來,有時還會失敗然后導致數據不準確。

下面記錄一下生產環境卡頓時排查的過程。
1、獲取ASH報告
- SQL> @?/rdbms/admin/ashrpt.sql
- --To specify absolute begin time:
- --[MM/DD/YY]] HH24:MI[:SS]
- --08/09/19 08:40:00




2、ASH分析
(1)Top User Events

(2)相關sql
Top SQL with Top Events

sql明細

(3)存儲過程

(4)TOP sessions

從上面分析可以看到兩個明顯的等待事件:wait for stopper event to be increased 等待事件和wait for a undo record 等待事件,這個應該是批量任務調度的時候產生了大量的大事務,產生了一些回滾造成了嚴重的資源消耗
3、處理大事務并發回滾
一般情況下wait for stopper event to be increased 等待事件是跟wait for a undo record 等待事件聯系起來的。
對于這個等待事件metalink上面有一篇文檔
- 464246.1
- Sometimes Parallel Rollback of Large Transaction may become very slow. After killing a large running transaction
- (either by killing the shadow process or aborting the database) then database seems to hang, or smon and parallel query servers
- taking all the available cpu.
- In fast-start parallel rollback, the background process Smon acts as a coordinator and rolls back a set of transactions in parallel
- using multiple server processes. Fast start parallel rollback is mainly useful when a system has transactions that run a long time
- before comitting, especially parallel Inserts, Updates, Deletes operations. When Smon discovers that the amount of recovery work is
- above a certain threshold, it automatically begins parallel rollback by dispersing the work among several parallel processes.
- There are cases where parallel transaction recovery is not as fast as serial transaction recovery, because the pq slaves are interfering
- with each other. It looks like the changes made by this transaction cannot be recovered in parallel without causing a performance problem.
- The parallel rollback slave processes are most likely contending for the same resource, which results in even worse rollback performance
- compared to a serial rollback.
解決的辦法:
- --關掉并發回滾,變成串行回滾(直接重啟解決)
- sql> alter system set fast_start_parallel_rollback = false scope=spfile;

通常,如果有很多并發進程,可以根據v$px_session視圖去查看,查看v$px_session視圖,發現所有的并發進程都是由smon進程導致(即qcsid列為smon進程的session id)
而smon進程的等待事件為wait for stopper event to be increased
即smon進程在做大事務的回滾,默認參數fast_start_parallel_rollback參數為low,即回滾時會啟動2*CPU個數 個并發進程。而由于是使用并發,所以可能由于并發之間相互使用共同的資源,導致回滾速度更慢。因為是生產環境,不能隨便重啟,所以我用了下面的方法來修改這個參數:
(1)查找smon進程ID
- select pid,spid,pname,username,tracefile from v$process where pname='SMON'

(2)禁用smon進程的事務清理(Disable SMON transaction cleanup)
- oradebug setorapid 'SMON's Oracle PID';
- oradebug event 10513 trace name context forever, level 2

(3)查詢V$FAST_START_SERVERS視圖,將所有smon啟用的并發進程殺掉

(4)修改fast_start_parallel_rollback參數
- alter system set fast_start_parallel_rollback=false;
(5)啟用smon進程的事務清理(enable transaction recovery)
- oradebug setorapid 'SMON's Oracle PID';
- oradebug event 10513 trace name context off
(6)獲得tracefile name
- oradebug tracefile_name

(7)驗證

4、業務驗證
修改后去業務驗證,到高峰期還是有卡頓現象,不過頻率減少了很多,報錯之類的也沒有了,同時觀察新的報告可以發現并發回滾之類的等待事件已經沒有了。