Last time we mentioned two fundamental principles while monitoring any object:
1. The monitor should collect as much important information as possible that will allow to accurately evaluate the health state of an object.
2. The monitor should have little to no effect on the activity of the object.
Normally, it is necessary to add couple lines in your existing Node.js server code to activate the monitor-plugin:
var monitor = require('monitor');// insert monitor module-plugin
var server = … // the definition of current Node.js server
monitor.Monitor(server); //add server to monitor
Now the monitor will begin collecting and measuring data. The monitor-plugin has an embedded simple HTTP server that sends accumulated data by request and should correspond (in current implementation) to the following pattern
10010 – the listen port of monitor-plugin (configurable)
‘node_monitor’ – the pathname keyword
‘action=getdata’ – command for getting collected data
‘access_code’ – the specially generated access code that changes for every session
Please notice that monitis-plugin server (in current implementation) listens the localhost only. This and usage of the security access code for every session gives enough security while monitoring. More detailed information can be found along with implemented code in the github repository.
Server monitoring metrics
There are a largely standard set of metrics which can be used to monitor the underlying health of any server.
* CPU Usage describes the level of utilisation of the system CPU(s) and is usually broken down into three states.
o IO Wait – indicates the proportion of CPU cycles spent waiting for IO (disk or network) events. If you experience large IO Wait proportions, it can indicate that your disks are causing a performance bottleneck.
o System – indicates the proportion of CPU cycles spent performing kernel-level processing. Generally you will find only a small proportion of your CPU cycles are spent on system tasks, Hence if you see spikes it could indicate a problem.
o User – indicates the proportion of CPU cycles spent performing user instigated processing. This is where you should see the bulk of your CPU cycles consumed; it includes activities such as web serving, application execution, and every other process not owned by the kernel.
o Idle – indicates the spare CPU capacity you have – all the cycles where the CPU is, quite literally, doing nothing.
* Load Average is a metric that indicates the level of load that a server is under at a given point in time. Usually evaluated as number of requests per second.
* RAM usage by server is broken down usually into the following parts.
o Free – the amount of unallocated RAM available. Linux systems tend to keep this as low as possible; and do not free up the system’s physical RAM until it is requested by another process.
o Inactive – RAM that is in-use for buffers and page caching, but hasn’t been used recently so will be reclaimed first for use by a running process.
o Active – RAM that has been used recently and will not be reclaimed unless we have insufficient Inactive RAM to claim from. In Linux systems this is generally the one to keep an eye on. Sudden, rapid increases signal a memory hungry process that will soon cause VM swapping to occur.
* The server uptime is a metric showing the elapsed time since the last reboot. Non-linear behavior of the server uptime line indicates that the server was rebooted somehow.
* Throughput – the amount of data traffic passing through the servers’ network interface is fundamentally important. It is usually broken down into inbound and outbound throughput and normally measured as average values for some period by kbit or kbyte per sec.
* Server Response time is defined as the duration from receiving a request to sending a response. Normally, it should not exceed reasonable timeout (usually this depends on the complexity of processing) but should be as little as possible. Usually, the average and peak response time is evaluated for some time period.
* The count of successfully processed requests is evaluated as the percentage of responses with 2xx status codes (success) to requests during observing time. This value should be as close as possible to 100%.
We have used part of these metrics and added some specific statistics that are important for our task (e.g. client platform, detailed info for response codes, etc.)
The results below were obtained on a Node.js server equipped with the monitor described above on a Debian6-x64. The server listens on HTTP (81) and HTTPS (443) ports and does not have a large load.
The data can be shown in graphical view too.