An Introduction to Hadoop
Hadoop : Introduction
Apache Hadoop is an open source software
project that enables the distributed processing of large data sets across
clusters of commodity servers. It is designed to scale up from a single server
to thousands of machines, with a very high degree of fault tolerance. Rather
than relying on high-end hardware, the resiliency of these clusters comes from
the software’s ability to detect and handle failures at the application layer.
The Apache Hadoop framework is composed of the following modules:
·
Hadoop Common which
contains the libraries and utilities needed by other Hadoop modules.
·
Hadoop Distributed File System (HDFS) -
a distributed file-system that stores data on the commodity machines, providing
very high aggregate bandwidth across the cluster.
·
Hadoop YARN - a
resource-management platform responsible for managing compute resources in
clusters and using them for scheduling of users' applications.
·
Hadoop MapReduce -
a programming model for large scale data processing.
Hadoop
was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was
working at Yahoo! at the time, named it after his
son's toy elephant. It was originally developed to support distribution for the Nutch search
engine project.
Hadoop
Key Characteristics
·
Reliable: When a node stops
working, the system redirects work to another location of the same data and the
processing continues without missing a beat.
·
Economical: Hadoop brings massively
parallel computing to commodity servers (systems with average configurations).
The result is a sizeable decrease in the cost per terabyte of storage, which in
turn makes it affordable to model all your data.
·
Scalable: New nodes can be added
whenever need arise, and they can be added without requiring any change to the
data formats, how data is loaded, how jobs are written, or the applications on
top.
·
Flexible: Hadoop is schema-less,
and can absorb any type of data, structured, unstructured, from any number of
sources. Data from multiple sources can be joined and aggregated in arbitrary
ways enabling deeper analyses than any one system can provide.
Difference between
Hadoop and RDBMS
|
||
Difference on
|
RDBMS
|
Hadoop
|
Data
Types
|
Structured
|
Multi
and Unstructured
|
Processing
|
Limited
|
Processing
coupled with data
|
Schema
|
Required
on Write
|
Required
on Read
|
Speed
|
Reads
are fast
|
Writes
are fast
|
Cost
|
Software
License
|
Support
only
|
Resources
|
Known
entity
|
Growing,
Complex, Wide
|
Best
Fit Use
|
Interactive
OLAP Analytics
Complex ACID Transactions, Operational Data Store |
Data
Discovery
Processing Unstructured Data Massive Storage/Processing |
Hadoop
Ecosystem
·
Apache Oozie: Oozie is a workflow
scheduler system to manage Apache Hadoop jobs.
·
Hive: A distributed data
warehouse. Hive manages data stored in HDFS and provides a query language based
on SQL (and which is translated by the runtime engine to MapReduce jobs) for
querying the data.
·
Pig Latin: A data flow language
and execution environment for exploring very large datasets. Pig runs on HDFS
and MapReduce clusters.
·
Mahout: Apache Mahout is a
library of scalable machine-learning algorithms, implemented on top of Apache
Hadoop and using the MapReduce paradigm.
·
MapReduce: A distributed data
processing model and execution environment that runs on large clusters of
commodity machines.
·
HBase: A distributed,
column-oriented non-relational database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
·
HDFS: A distributed
filesystem that runs on large clusters of commodity machines.
·
Flume: Flume is a framework
for populating Hadoop with data. Agents are populated throughout ones IT
infrastructure – inside web servers, application servers and mobile devices,
for example – to collect data and integrate it into Hadoop.
·
Sqoop: Sqoop is a connectivity
tool for moving data from non-Hadoop data stores – such as relational databases
and data warehouses – into Hadoop. It allows users to specify the target
location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata
or other relational databases to the target.
Hadoop
Core Components
·
Hadoop Distributed File System: HDFS,
the storage layer of Hadoop, is a distributed, scalable, Java-based file system
adept at storing large volumes of unstructured data.
·
Distributed across “nodes”
·
Natively redundant
·
NameNodetracks locations.
·
MapReduce: MapReduce is a software
framework that serves as the compute layer of Hadoop. MapReduce jobs are
divided into two (obviously named) parts. The “Map” function divides a query
into multiple parts and processes data at the node level. The “Reduce” function
aggregates the results of the “Map” function to determine the “answer” to the
query.
·
Splits a task across processors
·
“near” the data & assembles results
·
Self-Healing, High Bandwidth
·
Clustered storage
Hadoop
Distributed File System (HDFS) Architecture :
HDFS is a fault tolerant and self-healing
distributed file system designed to turn a cluster of industry standard servers
into a massively scalable pool of storage. Developed specifically for
large-scale data processing workloads where scalability, flexibility and
throughput are critical, HDFS accepts data in any format regardless of schema,
optimizes for high bandwidth streaming, and scales to proven deployments of
100PB and beyond.
Main
HDFS features:
·
Scale-Out Architecture:
Servers can be easily added to increase capacity.
·
Fault Tolerance:
Capability of automatically and seamlessly recover from any failures.
·
High Availability: It
serve mission-critical workflows and applications.
·
Load Balancing: Place
data intelligently keeping the load balanced for maximum efficiency and
utilization.
·
Flexible Access:
Multiple and open frameworks for serialization and file system mounts.
·
Tunable Replication:
Multiple copies (by default 3) of each file provide data protection and
computational performance.
·
Security: POSIX-based file
permissions for users and groups with optional LDAP integration
Assumptions
and Goals
·
Hardware Failure:
Hadoop runs on clusters of commodity hardware which are low cost and not very
reliable hardware. It’s instance may consist of hundreds or thousands of server
machines, each storing part of the file system’s data. The fact that there are
a huge number of components and that each component has a non-trivial
probability of failure means that some component of HDFS is always
non-functional. Thus, detection of these faults and a quick, automatic recovery
from them is an important architectural goal of HDFS.
·
Continuous Data Access:
Applications that run on HDFS need continuous access to their data sets. HDFS
is designed more for batch processing rather than interactive use by users. The
emphasis in HDFS is on high throughput of data access rather than the low
latency of data access. Thus, to increase data throughput rates, some POSIX
requirements are removed in a few key areas.
·
Large Data Sets:
Applications with large data sets (files) run on HDFS. Gigabyte to terabytes is
a typical file size in HDFS. Thus, HDFS is tuned to support large files. It
should be capable of supporting tens of millions of files in a single instance.
·
Coherency Model: HDFS
applications need a write-once-read-many access model for files. A file once
created, written, and closed need not be changed. This assumption simplifies
data coherency issues and enables high throughput data access. This is true
with Hadoop 1.x versions. Hadoop 2.x versions support appending-writes to
files.
·
“Moving Computation is Cheaper then Moving
Data”: A computation requested by an application is much more
efficient if it is executed near the data it operates on. This statement is
especially true if the size of the data set is really huge. This approach
minimizes network congestion and also increases the overall throughput of the
system. The assumption is that it is often better to migrate the computation
closer to where the data is located rather than moving the data to where the
application is running. HDFS provides interfaces for applications to move
themselves closer to where the data is located.
·
Portability across Heterogeneous Hardware and
Software Platforms: It is the design of HDFS that makes it easy to be portable
from one platform to another.
Main
Components of HDFS
·
NameNode: It is the
centerpiece(master) of an HDFS file system. It keeps the directory tree of all
the files present in the file system, and tracks where across the cluster the
file’s data is kept. We can say it stores the metadata about the cluster. It
does not store the data of these files itself. It is a master server that
manages the file system namespace and regulates access to files by clients.
Whenever client applications wish to locate a file or they want to
add/copy/move/delete a file, they can talk to the NameNode. The NameNode
responds the successful requests by returning a list of relevant DataNode servers
where the data lives.
·
The NameNode is a Single Point of Failure for the HDFS
Cluster. HDFS is not currently a High Availability system. When the NameNode
goes down, the file system goes offline.
·
DataNode: A DataNode stores data
in the Hadoop File System. DataNodes are slaves which are deployed on each
machine and provide the actual storage. A functional filesystem has more than
one DataNode, with data replicated across them. DataNodes are responsible for
serving read and write requests for the clients. DataNode instances can talk to
each other, which is what they do when they are replicating data. They can also
perform block creation, deletion, and replication upon instruction from the
NameNode.
·
Relationship between NameNode and DataNodes: The
NameNode and DataNode are pieces of software designed to run on commodity
machines across heterogeneous operating systems. HDFS is built using the Java
programming language; therefore, any machine that supports the Java programming
language can run HDFS. A typical deployment has a dedicated machine that runs
only the NameNode software. Each of the other machines in the cluster runs one
instance of the DataNode software.
·
The existence of a single NameNode in a cluster greatly
simplifies the architecture of the system. The NameNode is the arbitrator and
repository for all HDFS metadata. The system is designed in such a way that
user data never flows through the NameNode.
·
Secondary NameNode: The
secondary namenode regularly connects with the primary namenode and builds
snapshots of the primary namenode's directory information, which the system
then saves to local or remote directories. These checkpointed images can be
used to restart a failed primary namenode without having to replay the entire
journal of file-system actions, then to edit the log to create an up-to-date
directory structure. It is not a hot standby for the primary/master NameNode.
·
File System Namespace: HDFS
supports a traditional hierarchical file organization in which a user or an
application can create directories and store files inside them. The file system
namespace hierarchy is similar to most other existing file systems; you can
create, rename, relocate, and remove files. The NameNode maintains the file
system namespace. NameNode records any change to the file system namespace or
its properties. An application can specify the number of replicas of a file
that should be maintained by HDFS. The number of copies of a file is called the
replication factor of that file. This information is stored by the NameNode.
·
Data Replication: HDFS
is designed to reliably store very large files across machines in a large
cluster. It stores each file as a sequence of blocks; all blocks in a file
except the last block are of the same size. The blocks of a file are replicated
for fault tolerance. We can configure the block size and replication factor per
file. An application can specify the number of replicas of a file. The replication
factor can be specified at file creation time and can be changed later. Files
in HDFS are write-once and have strictly one writer at any time. The NameNode
makes all decisions in regards to the replication of blocks. Each DataNode in
the cluster periodically sends a Heartbeat and Blockreport to the NameNode.
Receipt of a Heartbeat implies that the DataNode is functioning properly. A
Blockreport contains a list of all blocks on a DataNode.
·
Replica Placement: HDFS
uses an intelligent replica placement model for reliability and performance.
Optimizing replica placement makes HDFS unique from most other distributed file
systems, and is facilitated by a rack-aware replica placement policy that uses
network bandwidth efficiently. The purpose of a rack-aware replica placement
policy is to improve data reliability, availability, and network bandwidth
utilization. Large HDFS environments typically operate across multiple
installations of computers. Communication between two data nodes in different
installations is typically slower than data nodes within the same installation.
Therefore, the name node attempts to optimize communications between data
nodes. The name node identifies the location of data nodes by their rack IDs. A
simple but non-optimal policy is to place replicas on unique racks. For
example, when the replication factor is 3, HDFS’s placement policy is to put
one replica on one node in the local rack, another on a node in a different
(remote) rack, and the last on a different node in the same remote rack. When a
read request comes, HDFS tries to read a request from a replica that is closest
to the reader to minimize global bandwidth consumption and read latency. If
there exists a replica on the same rack as the reader node, then that replica
is preferred to satisfy the read request.
·
File System Metadata:
EditLog is a transaction log used by NameNode to persistently record every
change that occurs to file system metadata. For example, when a new file is
created in HDFS, the NameNode inserts a record into the EditLog indicating the
creation. In the same manner, if replication factor of a file is changed, a new
record is inserted into the EditLog. The NameNode uses a file in its local host
OS file system to store the EditLog. The entire file system namespace,
including the mapping of blocks to files and file system properties, is stored
in a file called the FsImage. The FsImage is stored as a file in the NameNode’s
local file system too. A name node supports multiple copies of the FsImage and
EditLog files. With multiple copies of these files in place, any change to
either file propagates synchronously to all of the copies. When a name node
restarts, it uses the latest consistent version of FsImage and EditLog to
initialize itself.
MapReduce
MapReduce
is a software framework for easily writing applications which process
multi-terabyte data-sets in-parallel on thousands of nodes of low cost
commodity hardware in a reliable, fault-tolerant manner. A MapReduce job
consists of two separate tasks: Map Taskand Reduce Task.
An input data-set is usually split into independent chunks by a MapReduce job ,
which are then processed by the map tasks in a completely parallel manner. The
outputs of these maps are then sorted, and given as input to the reduce tasks.
The scheduling, monitoring of these tasks are taken care by the framework and
also the framework re-executes any failed tasks.
With MapReduce and Hadoop, instead of moving
data to the compute location, the computation happens at the location of the
data; storage of storage and the processing coexist on the same physical nodes
in the cluster which results in very high aggregate bandwidth across the
cluster. The MapReduce framework consists of :
·
A single master JobTracker and
·
one slave TaskTracker per cluster-node.
The responsibilities of master include
scheduling the jobs' component tasks on the slaves, monitoring them and
re-executing the failed tasks. The slaves are responsible for executing the
tasks as directed by the master.
Logical
View of MapReduce:
·
The Map and Reduce functions of MapReduce are both defined with
respect to data structured in (key, value) pairs. Map takes one pair of data
with a type in one data domain, and
returns a list of pairs in a different domain:
Map(k1,
v1) list(k1,v1)
Every
key-value pair in the input dataset is processed by the Map function which
produces a list of pairs for each call. After that, the MapReduce framework
collects all pairs with the same key from all lists and groups them together,
creating one group for each key.
·
After that, the Reduce function is applied in parallel to each
group created by the Map function, which in turn produces a collection of
values in the same domain:
Reduce(k2,
list (v2)) list(v3)
Each
Reduce call typically produces either one value v3 or an empty return, though
one call is allowed to return more than one value. The returns of all calls are
then collected as the desired result list.
Therefore,
the MapReduce transforms a list of (key, value) pairs into a list of values (as
shown in the figure below)
The two
biggest advantages of MapReduce are:
·
Taking processing to the data.
·
Processing data in parallel.
Hadoop
Configuration Files
The main files for configuring Hadoop are:
·
hadoop-env.sh: This
file contains the environment variables that are used in the scripts to run
Hadoop.
·
core-site.xml: It
contains the configuration settings for Hadoop Core such as I/O settings that
are common to HDFS and MapReduce.
·
hdfs-site.xml: All
the configuration settings for HDFS daemons, the namenode, the secondary
namenode and the data nodes are specified or can be specified in this file.
·
mapred-site.xml:
Configuration settings related to MapReduce daemons : the job-tracker and the
task-trackers can be done here.
·
masters: It contains a list of
machines (one per line) that each run a secondary namenode.
·
slaves: It contains a list of
machines (one per line) that each run a datanode and a task-tracker.
·
hadoop-metrics.properties: All
the properties for controlling how metrics are published in Hadoop can be
specified in this file.
·
log4j.properties: It
contains the properties for system log files, the namenode audit log and the
task log for the task-tracker child process.
Difference
between Hadoop 1.x and Hadoop 2.2
Limitations
of Hadoop 1.x:
·
Limited upto 4000 nodes per cluster.
·
O (# of tasks in a cluster)
·
JobTracker bottleneck – resource management, job scheduling and
monitoring
·
Only has one namespace for managing HDFS
·
Map and Reduce slots are static
·
Only job to run is MapReduce
Apache Hadoop 2.2.0 consists of significant
improvements over the previous stable release (hadoop-1.x). Below is a short
overview of the improvments to both HDFS and MapReduce in Hadoop 2.2
·
HDFS Federation:
Multiple independent Namenodes/Namespaces are used in federation to scale the
name service horizontally. The Namenodes are federated, which means the
Namenodes are independent and they don't require any coordination with each
other. All the Namenodes use datanodes as common storage for blocks. Each
datanode registers with all the Namenodes in the cluster. Datanodes send
periodic heartbeats and block reports and handles commands from the Namenodes.
·
MapReduce NextGen (YARN/MRv2): The
two major functions of the JobTracker were resource management and job
life-cycle management till Hadoop 1.x. into separate components. But, The new
architecture introduced in hadoop-0.23, divides these task into separate
components.
Now,
the new introduced ResourceManager manages the global
assignment of compute resources to applications and the per-application
ApplicationMaster manages the application’s scheduling and coordination. An
application is either a single job in the sense of classic MapReduce jobs or a
DAG (Directed Acyclic Graph) of such jobs. The computation fabric is formed by
the ResourceManager and the daemon which manages the user processes on that
machine i.e. NodeManager daemon. It is the responsibility of
per-application ApplicationMasterwhich is a framework specific
library and is tasked with negotiating resources from the ResourceManager and
working with the NodeManager(s) to execute and monitor the tasks.
Hadoop2.2
features:
·
Potentially up to 10,000 nodes per cluster
·
O (cluster size)
·
supports multiple namespace for managing HDFS
·
Efficient cluster utilization (YARN)
·
MRv1 backward and forward compatible
·
Any application can integrate with Hadoop