NoSQL Databases – A Look at Apache Cassandra

apache_cassandra_logoIn a series of blogs, Monitis has begun providing guidance on picking the right NoSQL database storage tool that meets your company’s needs. In our previous blog, we offered a comprehensive overview of why NoSQL technology is important and how it compares with Relational Database Management Systems (RDBMSs).

Now, we’d like to get a bit more specific and review various brands. We hope that this information will help choosing NoSQL DBs such as Apache Cassandra, MongoDB, CouchDB, Redis, Riak, HBase and others…easier. After all, you want to make sure that your data is being stored safely. Aren’t there enough worries out there about data security – whether the data is being stored on the cloud or behind your internal, private firewall?

NoSQL Brand Overview

There are many NoSQL storage tools and solutions that provide powerfully large amounts of data storing.  Currently, there are more than 100 popular NoSQL solutions, and most of them are open source and cost-free. Across the series of blog posts, we’ll be discussing the most popular and widespread NoSQL tools. But in this post, we’ll concentrate on Apache Cassandra.

cassandra nosql

Apache Cassandra

Apache Cassandra  – like so many other NoSQL tools – is an open-source distributed database system. It was originally created at Facebook in 2008 – but with plenty of input from other sources, notably Google (BigTable) and Amazon (Dynamo).

The result is that Apache Cassandra has an extremely scalable and fault-tolerant data infrastructure. Cassandra solves both real-time and analytical big data problems, from write-intensive workloads, to sub-millisecond caching layer reads, to analytical workloads involving petabytes of data using MapReduce.

Cassandra is a fully distributed column-oriented  data store that provides MapReduce implementation using Hadoop. All the nodes in the cluster play the same role. The data (existing and new) are shared automatically among the nodes.  Here’s some quick reference material about Apache Cassandra:

  • Orientation: Columnar
  • Created: Cassandra was created at Facebook based upon work at Google (BigTable) and Amazon (Dynamo) in 2008 and later donated to Apache as a top-level project.
  • Implementation language: Java
  • Distributed: Yes – with the ability to span multiple machines, multiple racks, and multiple data centers.
  • Storage: Decentralized Structured Storage System (DSSS).
  • Schema: Cassandra has very flexible schema. As originally described in the Google “BigTable” paper, Cassandra offers the organization of a traditional RDBMS  table layout combined with the flexibility and power of no stringent structure requirements. This allows you to store your data as you need to – without a performance penalty for changes. That’s important…because your storage needs evolve over time.
  • CAP: Cassandra is located mostly in the area of Availability and Partitioning tolerance (AP). Tradeoffs between consistency and latency are tunable in Cassandra.
  • Client: You can interact with Cassandra via embedded CLI, a RESTful service gateway – cassui, Java GUI client – cassandra-gui, by using Thrift via JDBC-like connection (jassandra) and others.
  • Open source: Yes (Apache License)
  • Production use: Cassandra has been used at Facebook (up to Nov, 2010), Digg, Twitter (analytics), Rackspace (cloud service, monitoring, logging), Mahalo (primary near-time data store), Reddit (persistent cache), Monitis (monitoring service), Cisco and more.

Cassandra can integrate with Hadoop to provide a single solution for both analytics and real time needs. Hadoop MapReduce offers the ability to run massive analytical queries against terabytes of data.  Cassandra offers caching on each of its nodes. Powered with Cassandra’s scalability characteristics, you can incrementally add nodes to the clusters to keep as much of your data in memory as you need. Thus, there’s no need for a separate caching layer.

The Cassandra ring is composed on identical nodes and data is automatically replicated to multiple nodes. So, any node can be added/removed easily without manual migration of data. The result is since every node within the cluster is identical, there is no single point of failure or bottlenecks.

In addition, the Cassandra cluster can be reconfigured very simply and fast – without losing data.  Cassandra clusters can grow into the hundreds or thousands of nodes, and naturally, there are machine failures. Cassandra utilizes gossip protocols to detect machine failure and recover when a machine is brought back into the cluster – all without your application being affected.

Cassandra’s Data Model

Cassandra is based on a key-value model.  A database consists of column families. A column family is a set of key-value pairs. Drawing an analogy with relational databases, you can think about column family as table, and separately, a key-value pair as a record in a table. A table in Cassandra is a distributed multi-dimensional map indexed by a key. Cassandra can handle maps with four or five dimensions:

Map with 4 dimensions Map with 5 dimensions
  1. Keyspace -> Column Family
  2. Column Family -> Column Family Row
  3. Column Family Row -> Columns
  4. Column -> Data value
  1. Keyspace -> Super Column Family
  2. Super Column Family -> Super Column Family Row
  3. Super Column Family Row -> Super Columns
  4. Super Column -> Columns
  5. Column -> Data value

Cassandra is built with a basic key-value model – but with two levels of nesting. At the first level, the value of a record is in turn a sequence of key-value pairs. These nested key-value pairs are called columns, where key is the name of the column. In other words you can say that a record in a column family has a key and consists of columns.

This level of nesting is mandatory – a record must contain at least one column.  At the second level, which is arbitrary, the value of a nested key-value pair can be a sequence of key-value pairs as well. When the second level of nesting is presented, outer key-value pairs are called super columns, with key being the name of the super column and inner key-value pairs are called columns.

The names of both columns and super columns can be used in two ways: as names or as values (usually reference value).  First, names can play the role of attribute names. For example, the name of a column in a record about User can be “Email.” That is how we used to think about columns in relational databases.

Second, names can also be used to store values!  For example, column names in a record which represent “Blog” can be identifiers of the posts of this blog and the corresponding column values are posts themselves. You can really use column (or super column) names to store some values because:

  • Theoretically there is no limitation on the number of columns (or super columns) for any given record ;
  • Names are byte arrays so that you can encode any value in it.

We hope that this look at Apache Cassandra gives you the detail you need to pick a database that will serve your needs. Additionally, we highly recommend protecting your database with 24/7 monitoring of servers and networks and cloud platforms where your data may be housed. That’s sound advice that thousands of Monitis users have followed and benefited from!

Next: look for an upcoming post on the benefits and tools built into Apache HBase.

You might also like