Wednesday, February 17, 2016

An Introduction to Hadoop

An Introduction to Hadoop
Hadoop : Introduction
Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer. The Apache Hadoop framework is composed of the following modules:
·         Hadoop Common which contains the libraries and utilities needed by other Hadoop modules.
·         Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
·         Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.
·         Hadoop MapReduce - a programming model for large scale data processing.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project.





Hadoop Key Characteristics
·         Reliable: When a node stops working, the system redirects work to another location of the same data and the processing continues without missing a beat.
·         Economical: Hadoop brings massively parallel computing to commodity servers (systems with average configurations). The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
·         Scalable: New nodes can be added whenever need arise, and they can be added without requiring any change to the data formats, how data is loaded, how jobs are written, or the applications on top.
·         Flexible: Hadoop is schema-less, and can absorb any type of data, structured, unstructured, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
Difference between Hadoop and RDBMS
Difference on
RDBMS
Hadoop
Data Types
Structured
Multi and Unstructured
Processing
Limited
Processing coupled with data
Schema
Required on Write
Required on Read
Speed
Reads are fast
Writes are fast
Cost
Software License
Support only
Resources
Known entity
Growing, Complex, Wide
Best Fit Use
Interactive OLAP Analytics
Complex ACID Transactions,
Operational Data Store
Data Discovery
Processing Unstructured Data
Massive Storage/Processing
Hadoop Ecosystem
·         Apache Oozie: Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
·         Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
·         Pig Latin: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
·         Mahout: Apache Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm.




·         MapReduce: A distributed data processing model and execution environment that runs on large clusters of commodity machines.
·         HBase: A distributed, column-oriented non-relational database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
·         HDFS: A distributed filesystem that runs on large clusters of commodity machines.
·         Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.
·         Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.
Hadoop Core Components
·         Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.
·         Distributed across “nodes”
·         Natively redundant
·         NameNodetracks locations.
·         MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.
·         Splits a task across processors
·         “near” the data & assembles results
·         Self-Healing, High Bandwidth
·         Clustered storage



Hadoop Distributed File System (HDFS) Architecture :
HDFS is a fault tolerant and self-healing distributed file system designed to turn a cluster of industry standard servers into a massively scalable pool of storage. Developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical, HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100PB and beyond.
Main HDFS features:
·         Scale-Out Architecture: Servers can be easily added to increase capacity.
·         Fault Tolerance: Capability of automatically and seamlessly recover from any failures.
·         High Availability: It serve mission-critical workflows and applications.
·         Load Balancing: Place data intelligently keeping the load balanced for maximum efficiency and utilization.
·         Flexible Access: Multiple and open frameworks for serialization and file system mounts.
·         Tunable Replication: Multiple copies (by default 3) of each file provide data protection and computational performance.
·         Security: POSIX-based file permissions for users and groups with optional LDAP integration
Assumptions and Goals
·         Hardware Failure: Hadoop runs on clusters of commodity hardware which are low cost and not very reliable hardware. It’s instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Thus, detection of these faults and a quick, automatic recovery from them is an important architectural goal of HDFS.
·         Continuous Data Access: Applications that run on HDFS need continuous access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis in HDFS is on high throughput of data access rather than the low latency of data access. Thus, to increase data throughput rates, some POSIX requirements are removed in a few key areas.
·         Large Data Sets: Applications with large data sets (files) run on HDFS. Gigabyte to terabytes is a typical file size in HDFS. Thus, HDFS is tuned to support large files. It should be capable of supporting tens of millions of files in a single instance.
·         Coherency Model: HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. This is true with Hadoop 1.x versions. Hadoop 2.x versions support appending-writes to files.
·         “Moving Computation is Cheaper then Moving Data”: A computation requested by an application is much more efficient if it is executed near the data it operates on. This statement is especially true if the size of the data set is really huge. This approach minimizes network congestion and also increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
·         Portability across Heterogeneous Hardware and Software Platforms: It is the design of HDFS that makes it easy to be portable from one platform to another.




Main Components of HDFS
·         NameNode: It is the centerpiece(master) of an HDFS file system. It keeps the directory tree of all the files present in the file system, and tracks where across the cluster the file’s data is kept. We can say it stores the metadata about the cluster. It does not store the data of these files itself. It is a master server that manages the file system namespace and regulates access to files by clients. Whenever client applications wish to locate a file or they want to add/copy/move/delete a file, they can talk to the NameNode. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
·         The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline.
·         DataNode: A DataNode stores data in the Hadoop File System. DataNodes are slaves which are deployed on each machine and provide the actual storage. A functional filesystem has more than one DataNode, with data replicated across them. DataNodes are responsible for serving read and write requests for the clients. DataNode instances can talk to each other, which is what they do when they are replicating data. They can also perform block creation, deletion, and replication upon instruction from the NameNode.
·         Relationship between NameNode and DataNodes: The NameNode and DataNode are pieces of software designed to run on commodity machines across heterogeneous operating systems. HDFS is built using the Java programming language; therefore, any machine that supports the Java programming language can run HDFS. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software.
·         The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.
·         Secondary NameNode: The secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode's directory information, which the system then saves to local or remote directories. These checkpointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. It is not a hot standby for the primary/master NameNode.
·         File System Namespace: HDFS supports a traditional hierarchical file organization in which a user or an application can create directories and store files inside them. The file system namespace hierarchy is similar to most other existing file systems; you can create, rename, relocate, and remove files. The NameNode maintains the file system namespace. NameNode records any change to the file system namespace or its properties. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
·         Data Replication: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are of the same size. The blocks of a file are replicated for fault tolerance. We can configure the block size and replication factor per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions in regards to the replication of blocks. Each DataNode in the cluster periodically sends a Heartbeat and Blockreport to the NameNode. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.
·         Replica Placement: HDFS uses an intelligent replica placement model for reliability and performance. Optimizing replica placement makes HDFS unique from most other distributed file systems, and is facilitated by a rack-aware replica placement policy that uses network bandwidth efficiently. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. Large HDFS environments typically operate across multiple installations of computers. Communication between two data nodes in different installations is typically slower than data nodes within the same installation. Therefore, the name node attempts to optimize communications between data nodes. The name node identifies the location of data nodes by their rack IDs. A simple but non-optimal policy is to place replicas on unique racks. For example, when the replication factor is 3, HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack. When a read request comes, HDFS tries to read a request from a replica that is closest to the reader to minimize global bandwidth consumption and read latency. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request.



·         File System Metadata: EditLog is a transaction log used by NameNode to persistently record every change that occurs to file system metadata. For example, when a new file is created in HDFS, the NameNode inserts a record into the EditLog indicating the creation. In the same manner, if replication factor of a file is changed, a new record is inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too. A name node supports multiple copies of the FsImage and EditLog files. With multiple copies of these files in place, any change to either file propagates synchronously to all of the copies. When a name node restarts, it uses the latest consistent version of FsImage and EditLog to initialize itself.
MapReduce
MapReduce is a software framework for easily writing applications which process multi-terabyte data-sets in-parallel on thousands of nodes of low cost commodity hardware in a reliable, fault-tolerant manner. A MapReduce job consists of two separate tasks: Map Taskand Reduce Task. An input data-set is usually split into independent chunks by a MapReduce job , which are then processed by the map tasks in a completely parallel manner. The outputs of these maps are then sorted, and given as input to the reduce tasks. The scheduling, monitoring of these tasks are taken care by the framework and also the framework re-executes any failed tasks.
With MapReduce and Hadoop, instead of moving data to the compute location, the computation happens at the location of the data; storage of storage and the processing coexist on the same physical nodes in the cluster which results in very high aggregate bandwidth across the cluster. The MapReduce framework consists of :
·         A single master JobTracker and
·         one slave TaskTracker per cluster-node.
The responsibilities of master include scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves are responsible for executing the tasks as directed by the master.



Logical View of MapReduce:
·         The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain:
Map(k1, v1) list(k1,v1)
Every key-value pair in the input dataset is processed by the Map function which produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key.
·         After that, the Reduce function is applied in parallel to each group created by the Map function, which in turn produces a collection of values in the same domain:
Reduce(k2, list (v2)) list(v3)
Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are then collected as the desired result list.
Therefore, the MapReduce transforms a list of (key, value) pairs into a list of values (as shown in the figure below)



The two biggest advantages of MapReduce are:
·         Taking processing to the data.
·         Processing data in parallel.
Hadoop Configuration Files
The main files for configuring Hadoop are:
·         hadoop-env.sh: This file contains the environment variables that are used in the scripts to run Hadoop.
·         core-site.xml: It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.
·         hdfs-site.xml: All the configuration settings for HDFS daemons, the namenode, the secondary namenode and the data nodes are specified or can be specified in this file.
·         mapred-site.xml: Configuration settings related to MapReduce daemons : the job-tracker and the task-trackers can be done here.
·         masters: It contains a list of machines (one per line) that each run a secondary namenode.
·         slaves: It contains a list of machines (one per line) that each run a datanode and a task-tracker.
·         hadoop-metrics.properties: All the properties for controlling how metrics are published in Hadoop can be specified in this file.
·         log4j.properties: It contains the properties for system log files, the namenode audit log and the task log for the task-tracker child process.
Difference between Hadoop 1.x and Hadoop 2.2
Limitations of Hadoop 1.x:
·         Limited upto 4000 nodes per cluster.
·         O (# of tasks in a cluster)
·         JobTracker bottleneck – resource management, job scheduling and monitoring
·         Only has one namespace for managing HDFS
·         Map and Reduce slots are static
·         Only job to run is MapReduce



Apache Hadoop 2.2.0 consists of significant improvements over the previous stable release (hadoop-1.x). Below is a short overview of the improvments to both HDFS and MapReduce in Hadoop 2.2
·         HDFS Federation: Multiple independent Namenodes/Namespaces are used in federation to scale the name service horizontally. The Namenodes are federated, which means the Namenodes are independent and they don't require any coordination with each other. All the Namenodes use datanodes as common storage for blocks. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.
·         MapReduce NextGen (YARN/MRv2): The two major functions of the JobTracker were resource management and job life-cycle management till Hadoop 1.x. into separate components. But, The new architecture introduced in hadoop-0.23, divides these task into separate components.
Now, the new introduced ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application‚Äôs scheduling and coordination. An application is either a single job in the sense of classic MapReduce jobs or a DAG (Directed Acyclic Graph) of such jobs. The computation fabric is formed by the ResourceManager and the daemon which manages the user processes on that machine i.e. NodeManager daemon. It is the responsibility of per-application ApplicationMasterwhich is a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
Hadoop2.2 features:
·         Potentially up to 10,000 nodes per cluster
·         O (cluster size)
·         supports multiple namespace for managing HDFS
·         Efficient cluster utilization (YARN)
·         MRv1 backward and forward compatible
·         Any application can integrate with Hadoop