In the last series we introduced the concept of ‘Big Data’ as a big deal that has captured global interest in recent years and provides a gold mine of opportunities for businesses that can leverage it effectively. Now we want to focus our attention on two of the major toolsets that can help companies capitalize on the immense amounts of structured and unstructured data available in today’s digital universe. These tools are Hadoop and NoSQL.
The Origins and Growth of Hadoop
If you’ve heard anything about Big Data chances are good that you’ve also heard some buzz around a platform called Hadoop. Hadoop was developed in 2005 by Doug Cutting and Mike Cafarella. Cutting, who worked for Yahoo at the time, actually named this tool after his son’s toy elephant. Hadoop really came to light as an outgrowth of efforts by Google, Yahoo, and other companies to provide faster methods for indexing web pages and handling the growing data bottleneck. By 2008 Yahoo announced that a 10,000 core Hadoop cluster was being used to run its production index search and since then there’s been no looking back. We’re at the point now where, as one publication put it, Hadoop is “the focal point of an immense big data movement.” A fast-growing ecosystem of commercial vendors has emerged in recent years with companies like Cloudera, HortonWorks, and MapR developing customizable and accessible out-of-the-box solutions for scaling up Hadoop. In 2012 the research firm IDC estimated the Hadoop market at $77M with projected annual growth of 60% to $813M by 2016.
How does Hadoop Work?
In a nutshell, Hadoop is an open source data processing framework for storing and analyzing massive amounts of unstructured data across many servers. Unlike a centrally located data warehouse, Hadoop works on an entirely different principle by leveraging the power of distributed computing and parallel processing to divide up data across many nodes (servers). The Hadoop infrastructure is complex and is comprised of its own distributed file system (HDSF) and a programming model called MapReduce. As the Apache site defines it, “Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.”
Please join us tomorrow as we continue our discussion on Hadoop.