还是年前的时候,绩效系统一直告警,查看了下,库上一直报ora-4031,很常见的错误,当时下意识的认为是内存不足了。
应该绩效手上在exadata一体机上,该一体机上还装了其他几套系统,内存资源比较紧张。
当时flush了下shared_pool,但没啥效果,因为是rac环境,便依次重启了实例,故障就消失了。
没过多久,绩效数据库又出现了这样的情况,共享池不足,就感觉不对劲了,但当时是年底了,马上就过年了,没啥心思探究。
结果过年期间,同属于exadata上crm系统也出现这样的情况,基本确定确实有问题了。
年后回来事情一直比较多,今天下午得空排查了一下,发现share pool一直在增长。大概可以推算出,由于共享池无节制的增长,导致buffer cache一直在被调缩,
直至share pool占据了绝大多数的sga,无法再增长,继而报错,导致数据库无法访问。
数据库版本及补丁
12.2.0.1181016
在MOS上搜了下,相关bug还挺多,到处瞧到处看,找到如下文档:
Bug 27824540 - ORA-4031 Error In Shared Pool Due To Leakage Of 'ges resource dynamic' Chunk In RAC Env (Doc ID 27824540.8)
Too many objects "ges resource dynamic" were allocated in the shared pool eventually failing with errors like: ORA-04031
Monitoring "ges resource dynamic" growth via the following shows a general upward trend:
select inst_id, name, round(bytes/(1024*1024*1024),1) in_gb from gv$sgastat where name = 'ges resource dynamic';
- example after an instance restart:
INST_ID NAME IN_GB
---------- -------------------------- ----------
1 ges resource dynamic .2
2 ges resource dynamic .4
3 ges resource dynamic .7
- example after several days of uptime shows upward trend :
INST_ID NAME IN_GB
---------- -------------------------- ----------
1 ges resource dynamic 8.6
2 ges resource dynamic 13.2
3 ges resource dynamic 16.1
Note:
When not getting 4031, but proactive monitoring DB performance using AWR diff reports, just check in section : "SGA Breakdown Difference",
and notice the %Diff values that will show the memory leak
Rediscovery Notes
-- instance alert log shows :
Errors in file /<path>/diag/rdbms/<db_name>/<oracle_sid>/trace/<oracle_sid>_lmd1_<pid>.trc (incident=nnnnnnn):
ORA-04031: unable to allocate 13840 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","ges resource dynamic")
Incident details in: /<path>/diag/rdbms/<db_name>/<oracle_sid>/incident/incdir_NNNNNN/<oracle_sid>_lmd1_<pid>_<iNNNNNN>.trc
-- incident trc file shows :
ORA-04031: unable to allocate 13840 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","ges resource dynamic")
=============================================
TOP 10 MEMORY USES FOR SGA HEAP SUB POOL 1
----------------------------------------------
"ges resource dynamic " 10 GB 55%
"free memory " 4426 MB 25%
"ges enqueues " 1204 MB 7%
"gcs resources " 371 MB 2%
============================================
TOP 10 MAXIMUM MEMORY USES FOR SGA HEAP SUB POOL 1
----------------------------------------------
"ges resource dynamic " 10 GB
"SQLA " 7254 MB
"free memory " 5676 MB
-- If a problem fulfills all of the conditions below, it is a duplicate of this problem.
1. Many objects "ges resource dynamic" are allocated in the shared pool.
2. Trace file of LMHB process traced Action 11 (kjgcr_GrowResourceCache) was executed, and was not resetted.
============================================================
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
============================================================
3. Trace file of LMHB process keep tracing failure of metric 7 (check lck heartbeat).
============================================================
kjgcr_ChkGlobalMetric: metric 7 (check lck heartbeat) failed for dbname ISPP01, inst 2, node 2
============================================================
Workaround
There are 2 possible workarounds.
1) Disable the action 11.
SQL> oradebug setorapname LMHB
SQL> oradebug dyn_gcr -a 11 -disable
Note:
- This oradebug command is available on 12.2 and later.
- To keep it persistent across instance restarts one can do as below :
$ sqlplus "/ as sysdba"
SQL> alter system set "_oradebug_cmds_at_startup"='dyn_gcr -a 11 -disable' scope=spfile sid='*';
Then restart database to make change take effect.
2) Disable the GES resource cache
Set the initialization parameter "_ges_direct_free" to TRUE.
Note:
- Completely disabling GES resource cache may lead to some other side affects like contention on TM lock for insert statements, so use this workaround with
caution if there is still a need.
-----------------
在库上执行了这条sql
select inst_id, name, round(bytes/(1024*1024*1024),1) in_gb from gv$sgastat where name = 'ges resource dynamic';
发现ges resource dynamic已经占了5,6g的内存,但其他几套12.2环境,补丁集是200414,不存在这样的问题。
打算采用workaround的方式来处理下,即
alter system set "_oradebug_cmds_at_startup"='dyn_gcr -a 11 -disable' scope=spfile sid='*';
--------------------------------------
年后回来,遇到了好几个oracle rac方面的bug,脑壳疼,后面有时间就整理记录下来。