MongoDB Basics - Monitoring

MongoDB monitoring is critical to ensure the smooth operation of your database. By monitoring different metrics, you can quickly identify performance issues and take corrective actions. In this post, we discussed some of the key metrics that you need to monitor when running a MongoDB database. By monitoring these metrics, you can ensure that your MongoDB database is performing optimally.

Capacity

The db.stats command can be used to obtain storage space information for each database.

Category	Indicator Name	Monitoring Item	Reference Threshold
Capacity	Index size	dbstats.indexSize	<= cacheSize
Capacity	Data size	dbstats.dataSize	<= 2T × 80%
Capacity	Storage size	dbstats.storageSize	<= diskSize × 60%

The cacheSize value of the database requires enough space to accommodate indexes, otherwise it will affect performance.
The demand for disk space is roughly equal to the sum of storageSize (the size of the WiredTiger compressed dataset) and indexSize, considering the water level set at around 80%.

Resource Usage

Connection Number

The db.serverStatus command can be used to obtain complete database status indicator information.

Category	Indicator Name	Monitoring Item	Reference Threshold
Connection	Available connections	connections.available	> 0
Connection	Current connections	connections.current	<= 8000

The database can limit the number of incoming connections that a single process can accept by setting maxIncomingConnections, which defaults to 65536.

Concurrent Queue

Category	Indicator Name	Monitoring Item	Reference Threshold
Concurrency	Ticket read usage	wiredTiger.concurrentTransactions.read.out	< 128
Concurrency	Ticket write usage	wiredTiger.concurrentTransactions.write.out	< 128
Concurrency	Ticket read remaining	wiredTiger.concurrentTransactions.read.available	> 0
Concurrency	Ticket write remaining	wiredTiger.concurrentTransactions.write.available	> 0

The WiredTiger engine uses the ticket voting method to manage concurrent threads. The number of tickets generally corresponds to the number of read and write operations that are performed simultaneously. When the remaining available tickets are 0, new read and write requests will be blocked (entering the blocking queue).

Memory and Cache Usage

Category	Indicator Name	Monitoring Item	Reference Threshold
Memory	Physical memory	memory.resident	< OS.TotalMemory × 85%
Memory	Virtual memory	memory.virtual	< OS.TotalMemory
Cache	Cache usage size	wiredTiger.cache.”bytes currently in the cache”	< maximum × 95%
Cache	Maximum cache size	wiredTiger.cache.”maximum bytes configured”	None
Cache	Dirty cache size	wiredTiger.cache.”tracked dirty bytes in the cache”	< maximum ×20%
Cache	Pages read into cache	wiredTiger.cache.”pages-read-into-cache”	Observe fluctuations
Cache	Unmodified eviction pages	wiredTiger.cache.”unmodified pages evicted”	Observe fluctuations

WiredTiger simultaneously uses the file system cache and the storage engine cache (defaulting to half of the memory). memory.resident refers to the physical memory occupied by MongoDB, and some schema designs that are unreasonable and redundant indexes may lead to excessive memory usage.
Dirty cache refers to data that has been modified in the cache but has not yet been flushed to disk. As the proportion of dirty data gradually increases, when it exceeds 20%, it means that there is a lot of pressure on cache elimination, and the response time of business requests will increase accordingly. Usually, if the write pressure is too high and the disk write performance is insufficient, the ratio of dirty data may remain high, and optimization can be performed by improving disk performance or horizontal scaling.
For businesses with more read scenarios, it is best to reserve sufficient cache space.
Checkpoint and TTL timers will generate backlog-style writes to some extent. If the disk capacity is poor, there will be spikes in I/O utilization. If there is business latency jitter, consider setting a smaller trigger interval to achieve smooth writing.

Throughput

Access Class Indicators

Category	Indicator Name	Monitoring Item	Reference Threshold
Access	Insert	opcounters.insert (growth rate)	Calculated by merging write operations
Access	Query	opcounters.query (growth rate)	Calculated by merging read operations
Access	Update	opcounters.update (growth rate)	Calculated by merging write operations
Access	Delete	opcounters.delete (growth rate)	Calculated by merging write operations
Access	Getmore	opcounters.getmore (growth rate)	Calculated by merging read operations
Access	Command	opcounters.command (growth rate)	<= 10000
Traffic	netIn	network.bytesIn (growth rate)	<= 100MB
Traffic	netOut	network.bytesOut (growth rate)	<= 100MB
Queue	Active read clients	globalLock.activeClients.readers	< 128
Queue	Active write clients	globalLock.activeClients.writers	< 128
Queue	Blocked read clients	globalLock.currentClients.readers	< 32
Queue	Blocked write clients	globalLock.currentClients.writers	< 32

opcounters are counters for current request operations, and checking the growth rate of different types of operations can be used to determine the current access throughput.
By monitoring read and write requests reasonably, potential load bottlenecks can be quickly discovered and measures can be taken to expand capacity before problems occur.

Cursor

Category	Indicator Name	Monitoring Item	Reference Threshold
Cursor	Number of open cursors	metrics.cursor.open.total	None
Cursor	Number of timed-out cursors	metrics.cursor.open.total	None
Cursor	Number of never-time-out cursors	metrics.cursor.open.noTimeout	None

MongoDB enables a cursor for each query and points it to a query result set.
When a connection is disconnected abnormally, the cursor may not be closed, and the database will automatically extend its timeout period. If there is no activity in the next 10 minutes (cursor.timeOut), it will be destroyed. If the application does not close the cursor in time, it will cause a large number of cursor backlogs, which will consume a lot of memory.

Replication Set

Category	Indicator Name	Monitoring Item	Reference Threshold
Replication	Node status	members.state	= PRIMARY/SECONDARY/ARBITER
Replication	Node replication lag	members.optimeDate[primary] -members.optimeDate[secondary]	< 60s
Replication	Replication window	getReplicationInfo().timeDiff	> 5h
Replication	Replication net value	oplog.window - oplog.lag	> 0

The replication lag describes the gap between the backup node and the primary node. The smaller the value, the better the situation.
The replication window is the time interval between the newest and oldest records in the oplog collection.
The replication net value is the difference between the replication window and the replication lag.

Reference

“MongoDB Advanced and Practical: Microservice Integration, Performance Optimization, and Architecture Management” (Tang Zhuozhang)

Leave a message