Very few people in the computer science industry wouldn’t have come across the terms “Big Data” and “Hadoop”. These are a few buzz words which we are coming across quite frequently now a days. Though sometimes over-hyped, it is a big deal for all the analytics companies and policy makers. So let’s see what this buzz is all about.
Ever since the onset of Internet, massive amounts of user data is getting generated. Particularly, in the last couple of years, social media like Facebook, Twitter and blogging websites have created humongous amounts of user data. According to Gartner, Big Data is very high volume, high velocity data which originates from multitude of sources. Being created in a random fashion, this data lacks the structure. This information can be analysed to help in smarter and efficient decision making. Big data differs from the traditional data in two significant ways. First, big data is very huge and can’t be stored in single machine. Second, it lacks the structure which traditional data has. Because of these characteristics handling and processing of big data requires special tools and techniques. This is where Hadoop kicks in.
Hadoop is an open source implementation of the Map-Reduce programming paradigm. Map-Reduce is a programming paradigm introduced by Google for processing and analyzing very large data-sets. All these programs which are developed in this paradigm parallely processes the data-sets and so they can be run on servers without much effort. The reason for scalability of this paradigm is the inherent distributive nature in the way solution works. The big task is divided into many small jobs which then run parallely on different machines and then combine to give the solution for the original big task we started with. The examples of usage of Hadoop are for analyzing user patterns on e-commerce websites and suggest users new products to buy.
This is traditionally called a Recommendations Systems and can be found in all of the major e-commerce websites. It can be used for processing large graphs like Facebook etc. The reason why Hadoop has simplified parallel processing is because the developer doesn’t have to care about the parallel programming worries. A developer only writes functions on how he wants to process the data.
Apache Hadoop Components
Hadoop framework consists of two major components, Storage and Processing. First, HDFS (Hadoop Distributed File System) handles the data storage across all the machines on which Hadoop cluster is running. Second, Map-Reduce handles the processing part of the framework. Let’s have a look at them individually.
HDFS (Hadoop Distributed File System)
HDFS is a distributed, scalable file system which draws its design heavily from GFS (Google File System) which also is a distributed file system. Distributed File Systems are required as data becomes too large to store on one single machine. Hence all the complexities and uncertainties of network come into picture which make Distributed Files Systems more complex than usual file systems. HDFS stores all the files in the blocks. The default block size is 64MB. All files on HDFS have multiple replicas which help in parallel processing. HDFS clusters have two types of nodes, first a namenode which is a master node and multiple datanodes which are slave nodes. Apart from these two, it can also have secondary namenode.
- Namenode: – It manages the namespace of the file system. It manages all the files and directories. Namenode has mapping between file and the blocks on which it is stored. All the files are accessed using these namenodes and datanodes.
- Datanode: – It actually stores the data in the form of blocks. Datanode keeps reporting to namenode about the files it has stored so that namenode is aware and data can be accessed. Namenode in such a way is the most crucial and single point of failure in the system without which data can’t be accessed.
- Secondary Namenode: – This node is responsible for check pointing the information from namenode. In case of failure we can use this node to restart the system.
Map-Reduce is a programming paradigm where every task is specified in terms of map function and a reduce function. Both these tasks run parallely on the clusters. The storage required for this functionality is provided by HDFS. Following are the main components of the Map-Reduce
- Job Tracker: – Map-Reduce jobs are submitted to Job Tracker. It has to talk to Namenode to fetch the data. Job Tracker submits the task to task trackers nodes. These task tracker nodes have to report to Job Tracker at regular intervals specifying they are alive and doing the task. If the task tracker doesn’t report then it is assumed to be dead and its work is reassigned to other task tracker. Job Tracker is again a single point of failure. If Job Tracker fails we will not be able to track the tasks.
- Task Tracker: – Task Tracker takes the tasks from Job tracker. These tasks are either map, reduce or shuffle. Task Tracker creates a separate JVM process for each task to make sure that process failure doesn’t result into Task Tracker failure. Task Tracker also reports to Job Tracker continuously so that Job Tracker can keep track of successful and failed Task Trackers.