Gosh; there are so many NoSQL database storage tools out there. It’s almost as bad as brands of sport drinks or water. Have you noticed that some mega-supermarkets have whole aisles dedicated to what we drink!
As an IT system administrator or manager, it’s sometimes very hard to compare various NoSQL tools. It involves considering your special computing needs, matching them to what is out there, aligning what’s right for your organization and then make the right decision!
That’s why Monitis, the first hosted all-in-one network and systems performance monitoring service for sysadmins, is publishing a series of blogs that are meant to offer a comprehensive guide to NoSQL technology and brands. We want to help you make the right choice that fits the particular needs of your company.
Why should we care, you may ask yourself? Increasingly, our clients, who depend on our ability to monitor servers and networks and a host of other key metrics 24/7 from the cloud, want our advice, too, on what kind of scalable and robust database technology to use. So, we’re obliging!
Here, in a series of blogs, we’ll present research on existing popular NoSQL data storage tools that are generally intended to store unprecedented large amounts of data, offer flexible and horizontal scalability and provide blazing-fast processing queries. We’ll also get down to the nitty-gritty and compare several well-known NoSQL DBs…such as Cassandra, MongoDB, CouchDB, Redis, Riak, HBase and others.
In this first post, let’s discuss the reason why NoSQL technology is important.
Generally, NoSQL isn’t relational, and it is designed for distributed data stores for very large scale data needs (e.g. Facebook or Twitter accumulate Terabits of data every day for millions of its users), there is no fixed schema and no joins. Meanwhile, relational database management systems (RDBMS) “scale up” by getting faster and faster hardware and adding memory. NoSQL, on the other hand, can take advantage of “scaling out” – which means spreading the load over many commodity systems.
The acronym NoSQL was coined in 1998, and while many think NoSQL is a derogatory term created to poke fun at SQL, in reality it means “Not Only SQL” rather than “No SQL at all.” The idea is that both technologies (NoSQL and RDBMSs) can co-exist and each has its place. Companies like Facebook, Twitter, Digg, Amazon, LinkedIn and Google all use NoSQL in some way — so the term has been in the current news often over the past few years.
Well, nothing, really. They just have their limitations. Consider these three problems with RDBMSs:
RDBMSs use a table-based normalization approach to data, and that’s a limited model. Certain data structures cannot be represented without tampering with the data, programs, or both.
They allow versioning or activities like: Create, Read, Update and Delete. For databases, updates should never be allowed, because they destroy information. Rather, when data changes, the database should just add another record and note duly the previous value for that record.
Performance falls off as RDBMSs normalize data. The reason: Normalization requires more tables, table joins, keys and indexes and thus more internal database operations for implement queries. Pretty soon, the database starts to grow into the terabytes, and that’s when things slow down.
1. Key-values Stores. The main idea here is using a hash table where there is a unique key and a pointer to a particular item of data. The Key/value model is the simplest and easiest to implement. But it is inefficient when you are only interested in querying or updating part of a value, among other disadvantages.
|Examples||Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB|
|Typical applications||Content caching (Focus on scaling to huge amounts of data, designed to handle massive load), logging, etc.|
|Data model||collection of Key-Value pairs|
|Weaknesses||Stored data has no schema|
2. Column Family Stores. These were created to store and process very large amounts of data distributed over many machines. There are still keys but they point to multiple columns. The columns are arranged by column family.
|Examples||Cassandra, HBase, Riak|
|Typical applications||Distributed file systems|
|Data model||Columns → column families|
|Strengths||Fast lookups, good distributed storage of data|
|Weaknesses||Very low-level API|
3. Document Databases. These were inspired by Lotus Notes and are similar to key-value stores. The model is basically versioned documents that are collections of other key-value collections. The semi-structured documents are stored in formats like JSON. Document databases are essentially the next level of Key/value, allowing nested values associated with each key. Document databases support querying more efficiently.
|Typical applications||Web applications (Similar to Key-Value stores, but the DB knows what the Value is)|
|Data model||Collections of Key-Value collections|
|Strengths||Tolerant of incomplete data|
|Weaknesses||Query performance, no standard query syntax|
4. Graph Databases. Instead of tables of rows and columns and the rigid structure of SQL, a flexible graph model is used which, again, can scale across multiple machines. NoSQL databases do not provide a high-level declarative query language like SQL to avoid overtime in processing. Rather, querying these databases is data-model specific. Many of the NoSQL platforms allow for RESTful interfaces to the data, while other offer query APIs.
|Examples||Neo4J, InfoGrid, Infinite Graph|
|Typical applications||Social networking, Recommendations (Focus on modeling the structure of data – interconnectivity)|
|Data model||“Property Graph” – Nodes|
|Strengths||Graph algorithms e.g. shortest path, connectedness, n degree relationships, etc.|
|Weaknesses||Has to traverse the entire graph to achieve a definitive answer. Not easy to cluster.|
Generally, the best places to use NoSQL technology is where the data model is simple; where flexibility is more important than strict control over defined data structures; where high performance is a must; strict data consistency is not required; and where it is easy to map complex values to known keys.
Logging/Archiving. Log-mining tools are handy because they can access logs across servers, relate them and analyze them.
Social Computing Insight. Many enterprises today have provided their users with the ability to do social computing through message forums, blogs etc.
External Data Feed Integration. Many companies need to integrate data coming from business partners. Even if the two parties conduct numerous discussions and negotiations, enterprises have little control over the format of the data coming to them. Also, there are many situations where those formats change very frequently – based on the changes in the business needs of partners.
Front-end order processing systems. Today, the volume of orders, applications and service requests flowing through different channels to retailers, bankers and Insurance providers, entertainment service providers, logistic providers, etc. is enormous. These requests need to be captured without any interruption whenever an end user makes a transaction from anywhere in the world. After, a reconciliation system typically updates them to back-end systems as well as updates the end user on his/her order status.
Enterprise Content Management Service. Content Management is now used across companies’ different functional groups, for instance, HR or Sales. The challenge is bringing together different groups using different meta data structures in a common content management service.
Real-time stats/analytics. Sometimes it is necessary to use the database as a way to track real-time performance metrics for websites (page views, unique visits, etc.) Tools like Google Analytics are great but not real-time — sometimes it is useful to build a secondary system that provides basic real-time stats. Other alternatives, such as 24/7 monitoring of web traffic, are a good way to go, too.
Here’s a short summary that might help you make your decision:
In our next series of posts, Monitis will walk you through seven popular NoSQL database tools and discuss their merits and – from our point of view – drawbacks. Stay tuned!