In the previous article we introduced Hadoop as the most popular Big Data toolset on the market today. We had just started talking about MapReduce as the major framework that makes Hadoop distinctive. So let’s continue the discussion where we left off.
MapReduce is really the key to understanding Hadoop’s parallel processing functionality as it enables data in various formats (XML, text, binary, log, SQL, ect) to be divided up and mapped out to many computers nodes and then recombined back to produce a final data set.
The framework is organized as a “map” function, which transforms a piece of data into some number of key/value pairs. Each of these elements are then sorted by their key and sent to the same node, where a “reduce” function is used to merge the values (of the same key) into a single result.
So instead of processing data in tables, as in a relational database like Oracle or MySQL, the Hadoop ecosystem essentially takes a “divide and conquer” approach to processing Big Data. A fairly simple way to help think of this is to imagine you have a machine with 4 hard drives that each process data at 100MB/s. At this rate, the machine would process a terabyte in 45 minutes. Alternatively, if that same terabyte of data is divided across 10 machines, each with 4 hard-drives, then processing time is cut to 4.5 minutes. Hadoop can be scaled from 2 nodes up to thousands of nodes, which equates to an order of magnitude gain in the processing of Big Data.
In the last several years Hadoop has scaled up to become the most popular Big Data processing ecosystem on the market today with more and more vendors emerging all the time. Since its inception at Yahoo, the toolset has demonstrated an impressive ability to rapidly scale up and handle ginormous amounts of data processing. Hadoop is the basis for massive operations like Facebook, which uses this platform to store and processes 100 petabytes of data online. As we saw in the first posts on Big Data, the immense amounts of data available today are going to seem miniscule compared to what’s next! Researchers predict that by 2020 over 30 billion objects will wirelessly be connected to the internet, and the growth in the internet of things is only beginning. All signs indicate that Hadoop will only continue to grow in popularity as companies continue to require sophisticated tools to process their Big Data needs.
In the next post we’ll continue this journey by looking at another Big Data toolset called NoSQL.