Skip to content

MongoDB Basics - Monitoring

homepage-banner

MongoDB monitoring is critical to ensure the smooth operation of your database. By monitoring different metrics, you can quickly identify performance issues and take corrective actions. In this post, we discussed some of the key metrics that you need to monitor when running a MongoDB database. By monitoring these metrics, you can ensure that your MongoDB database is performing optimally.

Capacity

The db.stats command can be used to obtain storage space information for each database.

Category Indicator Name Monitoring Item Reference Threshold
Capacity Index size dbstats.indexSize <= cacheSize
Capacity Data size dbstats.dataSize <= 2T × 80%
Capacity Storage size dbstats.storageSize <= diskSize × 60%
  • The cacheSize value of the database requires enough space to accommodate indexes, otherwise it will affect performance.
  • The demand for disk space is roughly equal to the sum of storageSize (the size of the WiredTiger compressed dataset) and indexSize, considering the water level set at around 80%.

Resource Usage

Connection Number

The db.serverStatus command can be used to obtain complete database status indicator information.

Category Indicator Name Monitoring Item Reference Threshold
Connection Available connections connections.available > 0
Connection Current connections connections.current <= 8000
  • The database can limit the number of incoming connections that a single process can accept by setting maxIncomingConnections, which defaults to 65536.

Concurrent Queue

Category Indicator Name Monitoring Item Reference Threshold
Concurrency Ticket read usage wiredTiger.concurrentTransactions.read.out < 128
Concurrency Ticket write usage wiredTiger.concurrentTransactions.write.out < 128
Concurrency Ticket read remaining wiredTiger.concurrentTransactions.read.available > 0
Concurrency Ticket write remaining wiredTiger.concurrentTransactions.write.available > 0
  • The WiredTiger engine uses the ticket voting method to manage concurrent threads. The number of tickets generally corresponds to the number of read and write operations that are performed simultaneously. When the remaining available tickets are 0, new read and write requests will be blocked (entering the blocking queue).

Memory and Cache Usage

Category Indicator Name Monitoring Item Reference Threshold
Memory Physical memory memory.resident < OS.TotalMemory × 85%
Memory Virtual memory memory.virtual < OS.TotalMemory
Cache Cache usage size wiredTiger.cache.”bytes currently in the cache” < maximum × 95%
Cache Maximum cache size wiredTiger.cache.”maximum bytes configured” None
Cache Dirty cache size wiredTiger.cache.”tracked dirty bytes in the cache” < maximum ×20%
Cache Pages read into cache wiredTiger.cache.”pages-read-into-cache” Observe fluctuations
Cache Unmodified eviction pages wiredTiger.cache.”unmodified pages evicted” Observe fluctuations
  • WiredTiger simultaneously uses the file system cache and the storage engine cache (defaulting to half of the memory). memory.resident refers to the physical memory occupied by MongoDB, and some schema designs that are unreasonable and redundant indexes may lead to excessive memory usage.
  • Dirty cache refers to data that has been modified in the cache but has not yet been flushed to disk. As the proportion of dirty data gradually increases, when it exceeds 20%, it means that there is a lot of pressure on cache elimination, and the response time of business requests will increase accordingly. Usually, if the write pressure is too high and the disk write performance is insufficient, the ratio of dirty data may remain high, and optimization can be performed by improving disk performance or horizontal scaling.
  • For businesses with more read scenarios, it is best to reserve sufficient cache space.
  • Checkpoint and TTL timers will generate backlog-style writes to some extent. If the disk capacity is poor, there will be spikes in I/O utilization. If there is business latency jitter, consider setting a smaller trigger interval to achieve smooth writing.

Throughput

Access Class Indicators

Category Indicator Name Monitoring Item Reference Threshold
Access Insert opcounters.insert (growth rate) Calculated by merging write operations
Access Query opcounters.query (growth rate) Calculated by merging read operations
Access Update opcounters.update (growth rate) Calculated by merging write operations
Access Delete opcounters.delete (growth rate) Calculated by merging write operations
Access Getmore opcounters.getmore (growth rate) Calculated by merging read operations
Access Command opcounters.command (growth rate) <= 10000
Traffic netIn network.bytesIn (growth rate) <= 100MB
Traffic netOut network.bytesOut (growth rate) <= 100MB
Queue Active read clients globalLock.activeClients.readers < 128
Queue Active write clients globalLock.activeClients.writers < 128
Queue Blocked read clients globalLock.currentClients.readers < 32
Queue Blocked write clients globalLock.currentClients.writers < 32
  • opcounters are counters for current request operations, and checking the growth rate of different types of operations can be used to determine the current access throughput.
  • By monitoring read and write requests reasonably, potential load bottlenecks can be quickly discovered and measures can be taken to expand capacity before problems occur.

Cursor

Category Indicator Name Monitoring Item Reference Threshold
Cursor Number of open cursors metrics.cursor.open.total None
Cursor Number of timed-out cursors metrics.cursor.open.total None
Cursor Number of never-time-out cursors metrics.cursor.open.noTimeout None
  • MongoDB enables a cursor for each query and points it to a query result set.
  • When a connection is disconnected abnormally, the cursor may not be closed, and the database will automatically extend its timeout period. If there is no activity in the next 10 minutes (cursor.timeOut), it will be destroyed. If the application does not close the cursor in time, it will cause a large number of cursor backlogs, which will consume a lot of memory.

Replication Set

Category Indicator Name Monitoring Item Reference Threshold
Replication Node status members.state = PRIMARY/SECONDARY/ARBITER
Replication Node replication lag members.optimeDate[primary] -members.optimeDate[secondary] < 60s
Replication Replication window getReplicationInfo().timeDiff > 5h
Replication Replication net value oplog.window - oplog.lag > 0
  • The replication lag describes the gap between the backup node and the primary node. The smaller the value, the better the situation.
  • The replication window is the time interval between the newest and oldest records in the oplog collection.
  • The replication net value is the difference between the replication window and the replication lag.

Reference

  • “MongoDB Advanced and Practical: Microservice Integration, Performance Optimization, and Architecture Management” (Tang Zhuozhang)
Leave a message