An Introduction to HBase
Apache
HBase is an open source, non-relational database that runs on top of the Hadoop
Distributed File System (HDFS) and is written in Java. It is columnar and
provides fault-tolerant storage and quick access to large quantities of sparse
data. Sparse data means small amounts of information which are caught within a
large collection of unimportant data, such as finding the 50 largest items in a
group of 2 billion records. HBase features compression, in-memory operation,
and Bloom filters on
a per-column basis. It also adds transactional capabilities to Hadoop, allowing
users to conduct updates, inserts and deletes. HBase is useful when you need
random, real time read/write access to your Big Data. HBase was created for
hosting very large tables with billions of rows and millions of columns over
cluster of commodity hardware.
Features
of HBase include:
·
Linear and modular scalability
·
Atomic and strongly consistent row-level operations.
·
Fault tolerant storage
for large quantities of data.
·
Flexible data model.
·
Automatic sharding: HBase
tables are distributed on the cluster via regions, and regions are
automatically split and re-distributed as your data grows.
·
Easy to use Java API for client access.
·
Hadoop/HDFS Integration: HBase
supports HDFS out of the box as its distributed file system.
·
MapReduce: HBase supports
massively parallelized processing via MapReduce for using HBase as both source
and sink.
·
Near real-time lookups.
·
High availability through automatic failover.
·
Block Cache and Bloom Filters: HBase
supports a Block Cache and Bloom Filters for high volume query optimization.
·
Server side processing via
filters and co-processors.
·
Support for exporting metrics via the Hadoop metrics subsystem
to files or Ganglia; or via JMX
·
Replication across the data
center.
History
of HBase
Google published a paper on Big Table in the
year 2006 and in the end of 2006, the HBase development started. An initial
HBase prototype was created as Hadoop contrib in the year 2007 and the first
usable HBase was released in 2007 end. In 2008, Hadoop became Apache top-level
project and HBase became its subproject. Also ,HBase 0.18, 0.19 released in
October 2008. In 2010, HBase becomes Apache top-level project. HBase 0.92 was
released in 2011. The latest release is 0.96.
When to
use HBase?
HBase is useful only if:
·
You have high volume of data to store.
·
You have large volume of unstructured data.
·
You want to have high scalability.
·
You can live without all the extra features that an RDBMS
provides (e.g., typed columns, secondary indexes, transactions, advanced query
languages, etc.) An application built against an RDBMS cannot be
"ported" to HBase by simply changing a JDBC driver, for example.
Consider moving from an RDBMS to HBase as a complete redesign as opposed to a
port.
·
You have lots of versioned data and you want to store all of
them.
·
Want to have column-oriented data.
·
You have enough hardware. Even HDFS doesn't do well with anything
less than 5 DataNodes (due to things such as HDFS block replication which has a
default of 3), plus a NameNode.
When
Not to use HBase?
HBase isn't suitable for every problem. It is
not useful if:
·
You only have a few thousand/million rows, then using a
traditional RDBMS might be a better choice due to the fact that all of your
data might wind up on a single node (or two) and the rest of the cluster may be
sitting idle.
·
You cannot live without RDBMS commands.
·
You have hardware less than 5 Data Nodes when replication factor
is 3.
What is
No SQL Database?
HBase is a type of "NoSQL" database.
"NoSQL" is a general term meaning that the database isn't an RDBMS
which supports SQL as its primary access language, but there are many types of
NoSQL databases Berkeley DB is an example of a local NoSQL database, whereas
HBase is very much a distributed database. HBase is really more a "Data
Store" than "Data Base" because it lacks many of the features
you find in an RDBMS, such as typed columns, secondary indexes, triggers, and
advanced query languages, etc.
·
Strongly consistent reads/writes: HBase is not an
"eventually consistent" DataStore. This makes it very suitable for
tasks such as high-speed counter aggregation.
·
Automatic sharding: HBase tables are distributed on the cluster
via regions, and regions are automatically split and re-distributed as your
data grows.
·
Automatic RegionServer failover
·
Hadoop/HDFS Integration: HBase supports HDFS out of the box as
its distributed file system.
·
MapReduce: HBase supports massively parallelized processing via
MapReduce for using HBase as both source and sink.
·
Java Client API: HBase supports an easy to use Java API for
programmatic access.
·
Thrift/REST API: HBase also supports Thrift and REST for
non-Java front-ends.
·
Block Cache and Bloom Filters: HBase supports a Block Cache and
Bloom Filters for high volume query optimization.
·
Operational Management: HBase provides build-in web-pages for
operational insight as well as JMX metrics.
Difference
between HBase and HDFS
HDFS:
·
Is a distributed file system that is well suited for the storage
of large files. It is not, however, a general purpose file system, and does not
provide fast individual record lookups in files.
·
Is suited for High Latency operations batch processing.
·
Data is primarily accessed through MapReduce.
·
Is designed for batch processing and hence doesn’t have a
concept of random reads/writes.
HBase:
·
Is built on top of HDFS and provides fast record lookups (and
updates) for large tables. HBase internally puts the data in indexed
"StoreFiles" that exist on HDFS for high-speed lookups.
·
Is built for Low Latency operations
·
Provides access to single rows from billions of records
·
Data is accessed through shell commands, Client APIs in Java,
REST, Avro or Thrift
Difference
between HBase and RDBMS
HBase
|
RDBMS
|
Column
oriented
|
Row-oriented
(mostly)
|
Flexible
schema, columns can be added on the fly
|
Fixed
schema
|
Designed
to store Denormalized data
|
Designed
to store Normalized data
|
Good
with sparse tables
|
Not
optimized for sparse tables
|
Joins
using MapReduce which is not optimized
|
Optimized
for joins
|
Tight
integration with MapReduce
|
No
integration with MapReduce
|
Horizontal
scalability – just add hardware
|
Hard
to shard and scale
|
Good
for semi-structured data as well as structured data
|
Good
for structured data
|
HBase
Run Modes
HBase
has 2 run modes: Standalone HBase and Distributed. Whichever mode you use,
you will need to configure HBase by editing files in the HBase conf directory.
At a minimum, you must edit conf/hbase-env.sh to tell
HBase which java to use. In this file you set HBase
environment variables such as the heapsize and other options for the JVM, the
preferred location for log files, etc. Set JAVA_HOME to point at the root of
your java install.
·
Standalone HBase: This is
the default mode. In standalone mode, HBase does not use HDFS -- it uses the
local filesystem instead -- and it runs all HBase daemons and a local ZooKeeper
all up in the same Java Virtual Machine. Zookeeper binds to a well known port
so clients may talk to HBase.
·
Distributed: Distributed mode
can be subdivided into:
·
Pseudo-distributed mode: In this
mode all daemons run on a single node. A pseudo-distributed mode is nothing but
simply a fully-distributed mode run on a single host. You can use this configuration
for testing and prototyping on HBase. Do not use this configuration for
production nor for evaluating HBase performance. This mode can run against the
local files system or it can run against an instance of the Hadoop Distributed
File System (HDFS).
·
Fully-distributed mode: In this
mode the daemons are spread across all nodes in the cluster. Fully-distributed
mode can ONLY run on HDFS.
HBase
Architecture
·
Splits
·
Auto-Sharding
·
Master
·
Region Servers
·
HFile
The
HBase Physical Architecture consists of servers in a Master-Slave relationship
as shown in the figure below. Typically, the HBase cluster has one Master node,
called HMaster and multiple Region Servers called HRegionServer.
Each Region Server contains multiple Regions called HRegions.
Logical
Architecture
Data in HBase is stored in Tables and the
Tables are stored in Regions. Table is partitioned into multiple Regions
whenever it becomes too large. These Regions are assigned to Region Servers
across the cluster. Each Region Server hosts roughly the same number of Regions
The
responsibilities of HMaster include:
·
Performing Administration
·
Managing and Monitoring the Cluster
·
Assigning Regions to the Region Servers
·
Controlling the Load Balancing and Failover
Whereas,
the HRegionServer perform the following:
·
Hosting and managing Regions
·
Handling the read/write requests
·
Splitting the Regions automatically
·
Communicating with the Clients directly
Physical
Architecture
An HBase RegionServer is assembled with an
HDFS DataNode. There is a direct communication between HBase clients and Region
Servers for sending and receiving data. HMaster is responsible for Region
assignment and handling of DDL operations. ZooKeeper maintains the online
configuration state. HMaster and ZooKeeper are not involved in the data path.
Anatomy
of a RegionServer
A
RegionServer contains a single Write-Ahead Log (WAL) (calledHLog),
single BlockCache, and multiple Regions. A region in turn contains multiple
Stores, one for each Column Family. Further each Region is made up of a
MemStore and multiple StoreFiles. A StoreFile corresponds to a single HFile.
HFiles and HLog are persisted on HDFS.
.META is a system table which contains the
mapping of Regions to Region Server. When trying to read or write data from
HBase, the clients read the required Region information from the .META table
and directly communicate with the appropriate Region Server. Each Region is
identified by the start key (inclusive) and the end key (exclusive).
HBase
Data Model
HBase Data Model is designed to accommodate
semi-structured data that could vary in data type, field size, and columns. The
layout of the data model is in such a way that it becomes easier to partition
the data and distribute the data across the cluster. The Data Model in HBase is
made of different logical components such as Tables, Rows, Column Families,
Column Qualifiers, Columns, Cells and Versions.
·
Tables: HBase organizes data
into tables. Table names are Strings and composed of characters that are safe
for use in a file system path. The HBase Tables are more like logical
collection of rows stored in separate partitions called Regions.
·
Rows: Within a table, data is
stored according to its row. Rows are identified uniquely by their row key. Row
keys do not have a data type and are always treated as a byte[ ] (byte array).
·
Column Families: Data
within a row is grouped by column family. Each Column Family has one more
Columns and these Columns in a family are stored together in a low level
storage file known as HFile. Column families also impact the physical
arrangement of data stored in HBase. For this reason, they must be defined up
front and are not easily modified. Every row in a table has the same column
families, although a row need not store data in all its families. Column
families are Strings and composed of characters that are safe for use in a file
system path.
·
Column Qualifier: Data
within a column family is addressed via its column qualifier, or simply,
column. Column qualifiers need not be specified in advance. Column Qualifier
consists of the Column Family name concatenated with the Column name using a
colon – example: columnfamily:columnname. Column qualifiers need not be
consistent between rows. Like row keys, column qualifiers do not have a data
type and are always treated as a byte[ ].
·
Cell: A combination of row
key, column family, and column qualifier uniquely identifies a cell. The data
stored in a cell is referred to as that cell’s value. Values also do not have a
data type and are always treated as a byte[ ].
·
Version/Timestamp:
Values within a cell are versioned. Versions are identified by their version
number, which by default is the timestamp of when the cell was written. Current
timestamp is used, if a timestamp is not specified during a write. If the
timestamp is not specified for a read, the latest one is returned. The number
of cell value versions retained by HBase is configured for each column family.
By default the number of cell versions is three. (Row, Family: Column,
Timestamp) à Value
Storage
Model
·
Column – oriented database (column families)
·
Table consists of Rows, each which has a primary key(row key)
·
Each Row may have any number of columns
·
Table schema only defines Column familes(column family can have
any number of columns)
·
Each cell value has a timestamp
Access
to data
·
Access data through table, Row key, Family, Column, Timestamp
·
API is very simple: Put, Get, Delete, Scan
·
A scan API allows you to efficiently iterate over ranges of rows
and be able to limit which column are returned or the number of versions of
each cell. You can match columns using filters and select versions using time
ranges, specifying start and end times.
·
This is how row is stored as key/value
HBase
and Zookeeper:
·
HBase uses Zookeeper extensively for region assignment.
“Zookeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group”
·
HBase can manage Zookeeper daemons for you or you can
install/manage them separately