Today I want to share with you some ideas about Web Oriented Storage System (WOSS) monitoring on databases and will use Riak as an example. I won’t be discussing here what’s meant by “web oriented” or “storage system”, I am assuming that you are already familiar with these concepts. There are several WOSS (Web Oriented Storage Systems), but today I will only deal with Riak.
So, what is Riak?
As mentioned above, Riak is a WOSS .
It provides the main concepts of storage systems such as
- Data Fetching
- Secondary Indexes
It comes in two flavors: Open Source and Enterprise.
A more indepth description and full details on Riak can be found in our other post and of course at Basho’s official wiki page.
Now, let’s talk about monitoring.
In my opinion, if a system exists, no matter what system it is, it needs to be monitored.
It’s very important to be informed about your system health at any given moment, otherwise you risk losing your data, money and possibly even your life! – e.g If your SYSTEM controls a nuclear power station or ballistic missile launch 😉
In the computer world, there are many parameters influencing the normal work of systems. In this discussion I will only consider storage systems and only RIAK.
Let’s assume we have a Riak cluster and it’s working without any trouble.
But the following questions arise: How is the stability of Riak maintained? What’s going on inside the cluster?
To answer these questions without any extra tools from third party developers, we will need the Riak-admin tool which comes with Riak binary utilities.
It’s easy to use .
Riak-admin does a lot of things, but we need only look at the status action.
Below are the most (IMHO) important metrics that can be monitored via the Riak-admin tool.
Just write Riak-admin status <Return> in the command line or do something like watch -n <seconds> Riak-admin status.
FSM_Time Counters represent the amount of time in microseconds required to traverse the GET or PUT Finite State Machine code, offering a picture of general node health. From your application’s perspective, FSM_Time effectively represents experienced latency. Mean, Median, and 95th-, 99th-, and 100th-percentile (Max) counters are displayed. These are one-minute stats.
GET_FSM_Sibling Stats offer a count of the number of siblings encountered by this node on the occasion of a GET request. These are one-minute stats.
Total Counters are data points that represent the total number of times a particular activity has occurred since this node was started.
Sample One-minute Counters
One-minute Counters are data points delineating the number of times a particular activity has occurred within the last minute on this particular node.
node_puts Number of PUTs coordinated by this node, including PUTs to non-local vnodes
node_gets Number of GETs coordinated by this node, including GETs to non-local vnodes
vnode_puts Number of PUTs coordinated by vnodes local to this node
vnode_gets Number of GETs coordinated by vnodes local to this node
read_repairs Number of Read Repairs this node has coordinated
Important metrics that can’t be monitored via Riak-admin status
File system cache
An active Riak node will have most of its free RAM consumed by the file system cache, which on Linux can be found as the “Cached” line in /proc/meminfo or the “cached” column when running the free -m command. A healthy size for this metric is 20-30% of available RAM.
Virtual memory size of the Riak process
Also known as VSZ, when this metric approaches the amount of available RAM, the Riak node may be unable to allocate more memory (depending on whether you have swap enabled). You can read this metric on Linux from the output of ps aux, and is measured in KB.
More usefull arguments for Riak-admin tool:
Output system information from a Riak cluster. This command will collect information from all nodes or a subset of nodes and output the data to a single text file.
The following information is collected:
· Current time and date
· VM statistics
· erlang:memory() summary
· Top 50 process memory hogs
· Registered process names
· Registered process name via regs()
· Non-zero mailbox sizes
· Timer status
· ETS summary
· Nodes summary
· net_kernel summary
· inet_db summary
· Alarm summary
· Global summary
· erlang:system_info() summary
· Loaded modules
· Riak Core config files
· Riak Core vnode modules
· Riak Core ring
· Riak Core latest ring file
· Riak Core active partitions
· Riak KV status
· Riak KV ringready
· Riak KV transfers
member_status – Prints the current status of all cluster members.
ring_status – Outputs the current claimant, its status, ringready, pending ownership handoffs, and a list of unreachable nodes.
vnode-status – Outputs the status of all vnodes the are running on the local node.
top – Top provides information about what the Erlang processes inside of Riak are doing. Top reports process reductions (an indicator of CPU utilization), memory used and message queue sizes
With grep –v ‘<<’ we are skipping the information about libs or apps versions, in which we are not interested now.
root@Camelot:/home/freeman# Riak-admin status|grep -v ‘<<‘
Attempting to restart script through sudo -u Riak
1-minute stats for ‘Riak@192.168.10.205’
vnode gets : 0
vnode_puts : 0
vnode_index_reads : 0
… <long list>
root@Camelot:/home/freeman# Riak-admin member_status
Attempting to restart script through sudo -u Riak
============= Membership =============
Status Ring Pending Node
valid 100.0% — ‘Riak@192.168.10.205’
Valid:1 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
I hope you have found this information useful, and as I said previously, full details on Riak can be foiund at wiki.basho.com.