BigData: May 2015

This document describes the steps needed to set up a distributed, multi-node Apache Hadoop cluster. The proper approach would be a 2 step approach. First step would be to install 2 single-node Hadoop machines, configure and test them as local Hadoop systems. Second step would be to merge both the single-node systems into a multi-node cluster, such that one Ubuntu box is designated as a Master and the other Ubuntu box as a Slave only. The Master would also act as a slave for data storage and processing.

Pre-requisites

There are certain required steps to be done before starting the Hadoop Installation. They are listed as below. Each step is further described below with screenshots and commands for additional understanding.

Configure single-node Hadoop cluster

Hadoop requires an installation of Java 1.6 or higher. It is always better to go with the latest Java version. In our case we have installed Java 7u-25.

Please refer to the Installation document for Single Node Hadoop cluster setup, to set up the Hadoop cluster on each Ubuntu box.
Use the same settings with respect to installation paths etc, on both machines so that there are no issues while merging the 2 boxes into a multi-node cluster setup.

Networking

Once the single-node Hadoop clusters are up and running, we need to make configuration changes to make one Ubuntu box as a “ master” (this box will also act as the slave) and the other Ubuntu box as the “slave”.

Networking plays a major role out here. Before merging the servers into a multi-node cluster, make sure that both Ubuntu boxes are able to ping each other. Both the boxes should be connected on the same network/hub so that they can speak to each other

Select one of the Ubuntu boxes to serve as the Master and the other to serve as the Slave. Make a note of their ip addresses. In our case, we have selected 172.16.17.68 as the master (hostname of master machine is Hadoopmaster) and 172.16.17.61 as the slave (host name of the slave machine is hadoopnode).
Make sure that the Hadoop services are shutdown before proceeding with the network set-up. Execute the following command on each box to shut down the services
```
bin/stop-all.sh
```
Provide both the machines these respective host names in their networking setup, i.e the /etc/hosts file. Execute the following command
```
sudo vi /etc/hosts
```

Add the following lines to the file.

172.16.17.68 Hadoopmaster
172.16.17.61 hadoopnode

In the event, more nodes (slaves) are added to the cluster, the /etc/hosts/ file on each machine should be updated accordingly, using unique names (ex. 172.16.17.xx hadoopnode1, 172.16.17.xy hadoopnode2 and so on.

Enabling SSH Access

The Hadoop user hduser on the master (i.e. hduser@hadoopmaster) should be able to connect to
i) It’s own user account, i.e. ssh Hadoopmaster
ii) Hadoop user account hduser on the slave (i.e. hduser@hadoopnode), via SSH

Add the hduser@Hadoopmaster’s public ssh key to the authorized key file of hduser@slaves, by executing the following command.
```
ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoonode
```
The above command will prompt for the password for hduser on the slave, once the password is provided, it will copy the SSH key for you.
Next, test the ssh setup by logging in with hduser from master and trying to connect to the user account hduser on the slave. This will create a slave key’s fingerprint on the master’s known_ hosts file. Also, test the connection from master to master. Execute the following commands
```
hduser@Hadoopmaster:~$ ssh Hadoopmaster
hduser@Hadoopmaster:~$ ssh hadoopnode
```

Hadoop Multi-Node Setup

There are certain required steps to be done for the Hadoop multi-node setup. They are listed as below. Each step is further described below with screenshots and commands for additional understanding.

Configuration

Configure the conf/masters file (on Hadoopmaster)
The conf/masters file contains the list of machines Hadoop will start as secondary NameNode. On the master, edit this file and add the line “Hadoopmaster” as below
```
vi /conf/masters
Hadoopmaster
```
Configure the conf/slaves file (on Hadoopmaster)
The conf/masters file contains the list of machines Hadoop will run its DataNodes and Task Trackers. On the master, edit this file and add the line “Hadoopmaster” and “hadoopnode” as below
```
vi /conf/slaves
Hadoopmaster
hadoopnode
```
In case, of additional nodes (slaves), add them to this list one on each line.

Configure the conf/core-site.xml.
The following configuration files, conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALLmachines should be updated to reflect the same (I our case both the Hadoopmaster and hadoopnode)
In core-site.xml (conf/core-site.xml), change the fs.default.name parameter, to specify the hostname of the master. In our case it is Hadoopmaster.

vi /conf/core-site.xml

Make the changes as below

<property>
<name>fs.default.name</name>
<value>hdfs://Hadoopmaster:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the 
FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to determine the host, port, 
etc. for a filesystem.</description>
</property>

Configure the conf/hdfs-site.xml.
In hdfs-site.xml (conf/hdfs-site.xml), change the dfs.replication parameter to specify the default block replication. This defines how many machines, a single file should be copied to before it becomes ready to use. In our case, this value is set to 2.
```
vi /conf/hdfs-site.xml
```
Make the changes as below
```
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
```

Configure the conf/mapred-site.xml.
In mapred-site.xml (conf/mapred-site.xml), change the mapred.job.tracker parameter, to specify the JobTracker host and port. In our case, this value is set to master.

vi /conf/mapred-site.xml

Make the changes as below

<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs at. 
If "local", then jobs are run in-process as a single map and reduce task.
</description>
</property>

Additional Settings

There are some other configuration options worth studying. The following information is taken from the Hadoop API Overview.
In file conf/mapred-site.xml:

“mapred.local.dir“
Determines where temporary MapReduce data is written. It also may be a list of directories.
“mapred.map.tasks“
As a rule of thumb, use 10x the number of slaves (i.e., number of TaskTrackers).
“mapred.reduce.tasks“
As a rule of thumb, use num_tasktrackers * num_reduce_slots_per_tasktracker * 0.99. If num_tasktrackers is small (as in the case of this tutorial), use (num_tasktrackers - 1) * num_reduce_slots_per_tasktracker.

Formatting the HDFS filesystem via NameNode

The first step to starting your Hadoop installation is the formatting of the Hadoop file system (HDFS) implemented on top of your local filesystem of your cluster. This step is required the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), execute the following command in the $HADOOP_HOME/bin directory.

hadoop namenode –format

Starting the multi-node cluster

Starting the cluster is performed in 2 steps.
i) Starting the HDFS Daemons – the NameNode daemon is started on the master (Hadoopmaster) and the DataNode daemons are started on all slaves (in our case Hadoopmaster and hadoopnode).
ii) Start the MapReduce Daemons - The JobTracker is started on the master (Hadoopmaster), and the TaskTracker daemons are started on all slaves (in our case Hadoopmaster and hadoopnode)

HDFS Daemon.
Start hdfs by executing the following command on the master (Hadoopmaster). This will bring up the HDFS with the NameNode up on the master and, the DataNode on the machine listed in the conf/slaves file.
```
bin/start-dfs.sh
```
Run the following command to list the processes running on the master.
```
 Jps
```
Mapred Daemon.
Start mapreduce by executing the following command on the master (Hadoopmaster). This will bring up the MapReduce cluster with the JobTracker up on the master and, the TaskTracker on the machine listed in the conf/slaves file.
```
bin/start-mapred.sh
```
Run the following command to list the processes running on the slave.
```
 Jps
```

Stopping the multi-node cluster

Like starting the cluster, stopping the cluster is also performed in 2 steps.
i) Stopping the MapReduce Daemons – the JobTracker daemon is stopped on the master (Hadoopmaster) and the TaskTracker daemons are stopped on all slaves (in our case Hadoopmaster and hadoopnode).
ii) Stopping the HDFS Daemons - the NameNode daemon is stopped on the master (Hadoopmaster) and the DataNode daemons are stopped on all slaves (in our case Hadoopmaster and hadoopnode).

Stop MapReduce Daemon.
Stop Mapreduce hdfs by executing the following command on the master (Hadoopmaster). This will shut down the JobTracker daemon and, the TaskTrackers on the machine listed in the conf/slaves file.
```
 bin/stop-mapred.sh
```
Stop HDFS Daemon.
Stop HDFS by executing the following command on the master (Hadoopmaster). This will shut down the HDFS by stopping the NameNode on the master and, the DataNode on the machine listed in the conf/slaves file.
```
 bin/stop-dfs.sh
```

An Introduction to HBase

Apache HBase is an open source, non-relational database that runs on top of the Hadoop Distributed File System (HDFS) and is written in Java. It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. Sparse data means small amounts of information which are caught within a large collection of unimportant data, such as finding the 50 largest items in a group of 2 billion records. HBase features compression, in-memory operation, and Bloom filters on a per-column basis. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. HBase is useful when you need random, real time read/write access to your Big Data. HBase was created for hosting very large tables with billions of rows and millions of columns over cluster of commodity hardware.

Features of HBase include:

· Linear and modular scalability

· Atomic and strongly consistent row-level operations.

· Fault tolerant storage for large quantities of data.

· Flexible data model.

· Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.

· Easy to use Java API for client access.

· Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.

· MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.

· Near real-time lookups.

· High availability through automatic failover.

· Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.

· Server side processing via filters and co-processors.

· Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

· Replication across the data center.

History of HBase

Google published a paper on Big Table in the year 2006 and in the end of 2006, the HBase development started. An initial HBase prototype was created as Hadoop contrib in the year 2007 and the first usable HBase was released in 2007 end. In 2008, Hadoop became Apache top-level project and HBase became its subproject. Also ,HBase 0.18, 0.19 released in October 2008. In 2010, HBase becomes Apache top-level project. HBase 0.92 was released in 2011. The latest release is 0.96.

When to use HBase?

HBase is useful only if:

· You have high volume of data to store.

· You have large volume of unstructured data.

· You want to have high scalability.

· You can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.

· You have lots of versioned data and you want to store all of them.

· Want to have column-oriented data.

· You have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.

When Not to use HBase?

HBase isn't suitable for every problem. It is not useful if:

· You only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

· You cannot live without RDBMS commands.

· You have hardware less than 5 Data Nodes when replication factor is 3.

What is No SQL Database?

HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases Berkeley DB is an example of a local NoSQL database, whereas HBase is very much a distributed database. HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.

· Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.

· Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.

· Automatic RegionServer failover

· Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.

· MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.

· Java Client API: HBase supports an easy to use Java API for programmatic access.

· Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.

· Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.

· Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.

Difference between HBase and HDFS

HDFS:

· Is a distributed file system that is well suited for the storage of large files. It is not, however, a general purpose file system, and does not provide fast individual record lookups in files.

· Is suited for High Latency operations batch processing.

· Data is primarily accessed through MapReduce.

· Is designed for batch processing and hence doesn’t have a concept of random reads/writes.

HBase:

· Is built on top of HDFS and provides fast record lookups (and updates) for large tables. HBase internally puts the data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.

· Is built for Low Latency operations

· Provides access to single rows from billions of records

· Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift

Difference between HBase and RDBMS

HBase	RDBMS
Column oriented	Row-oriented (mostly)
Flexible schema, columns can be added on the fly	Fixed schema
Designed to store Denormalized data	Designed to store Normalized data
Good with sparse tables	Not optimized for sparse tables
Joins using MapReduce which is not optimized	Optimized for joins
Tight integration with MapReduce	No integration with MapReduce
Horizontal scalability – just add hardware	Hard to shard and scale
Good for semi-structured data as well as structured data	Good for structured data

HBase Run Modes

HBase has 2 run modes: Standalone HBase and Distributed. Whichever mode you use, you will need to configure HBase by editing files in the HBase conf directory. At a minimum, you must edit conf/hbase-env.sh to tell HBase which java to use. In this file you set HBase environment variables such as the heapsize and other options for the JVM, the preferred location for log files, etc. Set JAVA_HOME to point at the root of your java install.

· Standalone HBase: This is the default mode. In standalone mode, HBase does not use HDFS -- it uses the local filesystem instead -- and it runs all HBase daemons and a local ZooKeeper all up in the same Java Virtual Machine. Zookeeper binds to a well known port so clients may talk to HBase.

· Distributed: Distributed mode can be subdivided into:

· Pseudo-distributed mode: In this mode all daemons run on a single node. A pseudo-distributed mode is nothing but simply a fully-distributed mode run on a single host. You can use this configuration for testing and prototyping on HBase. Do not use this configuration for production nor for evaluating HBase performance. This mode can run against the local files system or it can run against an instance of the Hadoop Distributed File System (HDFS).

· Fully-distributed mode: In this mode the daemons are spread across all nodes in the cluster. Fully-distributed mode can ONLY run on HDFS.

HBase Architecture

· Splits

· Auto-Sharding

· Master

· Region Servers

· HFile

The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown in the figure below. Typically, the HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each Region Server contains multiple Regions called HRegions.

Logical Architecture

Data in HBase is stored in Tables and the Tables are stored in Regions. Table is partitioned into multiple Regions whenever it becomes too large. These Regions are assigned to Region Servers across the cluster. Each Region Server hosts roughly the same number of Regions

The responsibilities of HMaster include:

· Performing Administration

· Managing and Monitoring the Cluster

· Assigning Regions to the Region Servers

· Controlling the Load Balancing and Failover

Whereas, the HRegionServer perform the following:

· Hosting and managing Regions

· Handling the read/write requests

· Splitting the Regions automatically

· Communicating with the Clients directly

Physical Architecture

An HBase RegionServer is assembled with an HDFS DataNode. There is a direct communication between HBase clients and Region Servers for sending and receiving data. HMaster is responsible for Region assignment and handling of DDL operations. ZooKeeper maintains the online configuration state. HMaster and ZooKeeper are not involved in the data path.

Anatomy of a RegionServer

A RegionServer contains a single Write-Ahead Log (WAL) (calledHLog), single BlockCache, and multiple Regions. A region in turn contains multiple Stores, one for each Column Family. Further each Region is made up of a MemStore and multiple StoreFiles. A StoreFile corresponds to a single HFile. HFiles and HLog are persisted on HDFS.

.META is a system table which contains the mapping of Regions to Region Server. When trying to read or write data from HBase, the clients read the required Region information from the .META table and directly communicate with the appropriate Region Server. Each Region is identified by the start key (inclusive) and the end key (exclusive).

HBase Data Model

HBase Data Model is designed to accommodate semi-structured data that could vary in data type, field size, and columns. The layout of the data model is in such a way that it becomes easier to partition the data and distribute the data across the cluster. The Data Model in HBase is made of different logical components such as Tables, Rows, Column Families, Column Qualifiers, Columns, Cells and Versions.

· Tables: HBase organizes data into tables. Table names are Strings and composed of characters that are safe for use in a file system path. The HBase Tables are more like logical collection of rows stored in separate partitions called Regions.

· Rows: Within a table, data is stored according to its row. Rows are identified uniquely by their row key. Row keys do not have a data type and are always treated as a byte[ ] (byte array).

· Column Families: Data within a row is grouped by column family. Each Column Family has one more Columns and these Columns in a family are stored together in a low level storage file known as HFile. Column families also impact the physical arrangement of data stored in HBase. For this reason, they must be defined up front and are not easily modified. Every row in a table has the same column families, although a row need not store data in all its families. Column families are Strings and composed of characters that are safe for use in a file system path.

· Column Qualifier: Data within a column family is addressed via its column qualifier, or simply, column. Column qualifiers need not be specified in advance. Column Qualifier consists of the Column Family name concatenated with the Column name using a colon – example: columnfamily:columnname. Column qualifiers need not be consistent between rows. Like row keys, column qualifiers do not have a data type and are always treated as a byte[ ].

· Cell: A combination of row key, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell’s value. Values also do not have a data type and are always treated as a byte[ ].

· Version/Timestamp: Values within a cell are versioned. Versions are identified by their version number, which by default is the timestamp of when the cell was written. Current timestamp is used, if a timestamp is not specified during a write. If the timestamp is not specified for a read, the latest one is returned. The number of cell value versions retained by HBase is configured for each column family. By default the number of cell versions is three. (Row, Family: Column, Timestamp) à Value

Storage Model

· Column – oriented database (column families)

· Table consists of Rows, each which has a primary key(row key)

· Each Row may have any number of columns

· Table schema only defines Column familes(column family can have any number of columns)

· Each cell value has a timestamp

Access to data

· Access data through table, Row key, Family, Column, Timestamp

· API is very simple: Put, Get, Delete, Scan

· A scan API allows you to efficiently iterate over ranges of rows and be able to limit which column are returned or the number of versions of each cell. You can match columns using filters and select versions using time ranges, specifying start and end times.

· This is how row is stored as key/value

HBase and Zookeeper:

· HBase uses Zookeeper extensively for region assignment. “Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group”

· HBase can manage Zookeeper daemons for you or you can install/manage them separately