Hadoop 101

Hadoop, which was named after a toy elephant, is a serious framework for doing distributed data processing across thousands of nodes and petabytes of data.  It’s loosely based on some big time Google projects, MapReduce, Big Table, and Google File System.  A key thing to understand about Hadoop is that it’s not just one monolithic project. Hadoop has a number of different pieces that all work together to help you crunch some very big data.

Hadoop is all about processing data, and Hadoop’s MapReduce engine is what makes all this processing possible. MapReduce is a process that Google came up with for churning through petabytes of data on commodity hardware. Here’s how it works.

There is a master node that takes a problem and divides it into smaller problems that can be handled by slave nodes.  This is the “Map” step.  As the slave nodes finish their smaller problems, they hand their answers back to the master node, and then the master node combines all these smaller answers to get the initial problem’s solution. This is called the “Reduce” step.

Now to support all of this data processing and map reducing, you have to store your data somewhere, and Hadoop is ready to help in that department.   If you were processing a large number of files, you would use HDFS (Hadoop Filesytem).  HDFS is an official Hadoop subproject, and it provides you with a highly fault tolerant filesystem that is distributed over a multitude of nodes with replication builtin.  Now, you have a distributed fault tolerant way to convert 50PB of images from their originals to thumbnails or some other other file format entirely!

Not all your data is files, so Hadoop also has a database called HBase.  HBase is the official Hadoop database.  You can use HBase to store data like our user comments and come up with answers to how many likes your users have gotten.  Now this is easy to do with a SQL database when it’s relatively small, but when you are Twitter or Facebook, SQL just isn’t going to cut it.

One last thing to note is that like I said earlier, Hadoop is not monolithic.  You have a lot of options.  For instance, you could swap out HDFS for Amazon’s S3.  Which could be a pretty smart idea considering that Amazon has already built one of these fault tolerant distributed filesystems for you.  Also, let’s say that HBase is not really doing it for you. You could swap out HBase for Cassandra.  There are over 10 sub projects listed on Hadoop’s main page, and there is also a very active developer community.

Once you have set up your Hadoop cluster, you are going to need to monitor it.  As one can imagine, running a large cluster is not going to be cheap, so keeping the performance of your cluster up and identifying problems early is critical.  In a future article we will explore some of the ways that we can better monitor our Hadoop clusters.