MongoDb 中 serverStatus was very slow 的原因分析

MongoDb 中 serverStatus was very slow 的原因分析

在同一个集群中,两个从节点 db117,db118 都各出现了1次 “serverStatus was very slow”


db117出现 very slow:


2017-09-14T18:55:49.599+0800 [conn13113916] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 1214, after repl: 1214, at end: 1214 }

前后没有出现大的查询(>10秒以上)。其它性能指标也正常。

只是前面看到空间增加了30个 连接。

2017-09-14T18:55:48.157+0800 [initandlisten] connection accepted from 10.116.207.64:57026 #13113915 (264 connections now open)

...................

2017-09-14T18:55:49.551+0800 [initandlisten] connection accepted from 10.174.177.10:64840 #13113946 (295 connections now open)

2017-09-14T18:55:49.599+0800 [conn13113916] serverStatus was very slow:

应该是突然查询数量上来,虽然每个查询的时间不长,但IO,CPU 有个波动的原因。

(但在监控数据中也没看到有突增)


此现象在DB118 中昨天也出现一次,但不是在同时。


2017-09-14T14:12:16.102+0800 [initandlisten] connection accepted from 10.116.207.112:10800 #13301443 (286 connections now open)

..............................................................

2017-09-14T14:12:19.780+0800 [initandlisten] connection accepted from 10.168.168.212:51264 #13301552 (395 connections now open)

2017-09-14T14:12:19.819+0800 [conn13300638] query Mallcoo.MallCard query: { UID: 17433632, Status: 1, MallID: 273 } planSummary: IXSCAN { UID: 1.0, MallID: 1.0 } ntoreturn:1 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:4079097 nreturned:1 reslen:99 4079ms

2017-09-14T14:12:19.844+0800 [conn13301520] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 1516, after repl: 1516, at end: 1526 }

2017-09-14T14:12:19.844+0800 [conn13301520] command admin.$cmd command: serverStatus { serverStatus: 1.0 } keyUpdates:0 numYields:0 locks(micros) r:83 reslen:7456 1557ms


DB118 14:12 时间,IO有一个高峰点,从 2%-->12%。而db117 却没有,问题原因比较难找。

从两次的slow 日志来看,

2017-09-14T14:12:19.844+0800 [conn13301520] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 1516, after repl: 1516, at end: 1526 }


慢的原因都在:after recordStats: 1516 ,

也就是说有查询,而内存里没数据,出现了大量的 page faults.

再就是IO ,CPU 有个暂时的繁忙状态。


这两个从节点内存现在为64G,虽然不是很高。但暂时因为业务相对增长比较慢,不会增加内存。

如果有大量的此类预警日志出现,现场可用 mongostat 查看 faults 状态值情况,如果一直都高于 5以上。可以认为是内存不足了。

推荐阅读