Monday, May 18, 2015

An Introduction to HBase

An Introduction to HBase
Apache HBase is an open source, non-relational database that runs on top of the Hadoop Distributed File System (HDFS) and is written in Java. It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. Sparse data means small amounts of information which are caught within a large collection of unimportant data, such as finding the 50 largest items in a group of 2 billion records. HBase features compression, in-memory operation, and Bloom filters on a per-column basis. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. HBase is useful when you need random, real time read/write access to your Big Data. HBase was created for hosting very large tables with billions of rows and millions of columns over cluster of commodity hardware.
Features of HBase include:
·         Linear and modular scalability
·         Atomic and strongly consistent row-level operations.
·         Fault tolerant storage for large quantities of data.
·         Flexible data model.
·         Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
·         Easy to use Java API for client access.
·         Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.
·         MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
·         Near real-time lookups.
·         High availability through automatic failover.
·         Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
·         Server side processing via filters and co-processors.
·         Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
·         Replication across the data center.
History of HBase
Google published a paper on Big Table in the year 2006 and in the end of 2006, the HBase development started. An initial HBase prototype was created as Hadoop contrib in the year 2007 and the first usable HBase was released in 2007 end. In 2008, Hadoop became Apache top-level project and HBase became its subproject. Also ,HBase 0.18, 0.19 released in October 2008. In 2010, HBase becomes Apache top-level project. HBase 0.92 was released in 2011. The latest release is 0.96.



When to use HBase?
HBase is useful only if:
·         You have high volume of data to store.
·         You have large volume of unstructured data.
·         You want to have high scalability.
·         You can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
·         You have lots of versioned data and you want to store all of them.
·         Want to have column-oriented data.
·         You have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
When Not to use HBase?
HBase isn't suitable for every problem. It is not useful if:
·         You only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
·         You cannot live without RDBMS commands.
·         You have hardware less than 5 Data Nodes when replication factor is 3.
What is No SQL Database?
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases Berkeley DB is an example of a local NoSQL database, whereas HBase is very much a distributed database. HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
·         Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.
·         Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
·         Automatic RegionServer failover
·         Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.
·         MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
·         Java Client API: HBase supports an easy to use Java API for programmatic access.
·         Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
·         Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
·         Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.
Difference between HBase and HDFS
HDFS:
·         Is a distributed file system that is well suited for the storage of large files. It is not, however, a general purpose file system, and does not provide fast individual record lookups in files.
·         Is suited for High Latency operations batch processing.
·         Data is primarily accessed through MapReduce.
·         Is designed for batch processing and hence doesn’t have a concept of random reads/writes.
HBase:
·         Is built on top of HDFS and provides fast record lookups (and updates) for large tables. HBase internally puts the data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
·         Is built for Low Latency operations
·         Provides access to single rows from billions of records
·         Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift
Difference between HBase and RDBMS
HBase
RDBMS
Column oriented
Row-oriented (mostly)
Flexible schema, columns can be added on the fly
Fixed schema
Designed to store Denormalized data
Designed to store Normalized data
Good with sparse tables
Not optimized for sparse tables
Joins using MapReduce which is not optimized
Optimized for joins
Tight integration with MapReduce
No integration with MapReduce
Horizontal scalability – just add hardware
Hard to shard and scale
Good for semi-structured data as well as structured data
Good for structured data
HBase Run Modes
HBase has 2 run modes: Standalone HBase and Distributed. Whichever mode you use, you will need to configure HBase by editing files in the HBase conf directory. At a minimum, you must edit conf/hbase-env.sh to tell HBase which java to use. In this file you set HBase environment variables such as the heapsize and other options for the JVM, the preferred location for log files, etc. Set JAVA_HOME to point at the root of your java install.
·         Standalone HBase: This is the default mode. In standalone mode, HBase does not use HDFS -- it uses the local filesystem instead -- and it runs all HBase daemons and a local ZooKeeper all up in the same Java Virtual Machine. Zookeeper binds to a well known port so clients may talk to HBase.
·         Distributed: Distributed mode can be subdivided into:
·         Pseudo-distributed mode: In this mode all daemons run on a single node. A pseudo-distributed mode is nothing but simply a fully-distributed mode run on a single host. You can use this configuration for testing and prototyping on HBase. Do not use this configuration for production nor for evaluating HBase performance. This mode can run against the local files system or it can run against an instance of the Hadoop Distributed File System (HDFS).
·         Fully-distributed mode: In this mode the daemons are spread across all nodes in the cluster. Fully-distributed mode can ONLY run on HDFS.
HBase Architecture
·         Splits
·         Auto-Sharding
·         Master
·         Region Servers
·         HFile
The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown in the figure below. Typically, the HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each Region Server contains multiple Regions called HRegions.



Logical Architecture
Data in HBase is stored in Tables and the Tables are stored in Regions. Table is partitioned into multiple Regions whenever it becomes too large. These Regions are assigned to Region Servers across the cluster. Each Region Server hosts roughly the same number of Regions
The responsibilities of HMaster include:
·         Performing Administration
·         Managing and Monitoring the Cluster
·         Assigning Regions to the Region Servers
·         Controlling the Load Balancing and Failover
Whereas, the HRegionServer perform the following:
·         Hosting and managing Regions
·         Handling the read/write requests
·         Splitting the Regions automatically
·         Communicating with the Clients directly











Physical Architecture
An HBase RegionServer is assembled with an HDFS DataNode. There is a direct communication between HBase clients and Region Servers for sending and receiving data. HMaster is responsible for Region assignment and handling of DDL operations. ZooKeeper maintains the online configuration state. HMaster and ZooKeeper are not involved in the data path.










Anatomy of a RegionServer
A RegionServer contains a single Write-Ahead Log (WAL) (calledHLog), single BlockCache, and multiple Regions. A region in turn contains multiple Stores, one for each Column Family. Further each Region is made up of a MemStore and multiple StoreFiles. A StoreFile corresponds to a single HFile. HFiles and HLog are persisted on HDFS.
.META is a system table which contains the mapping of Regions to Region Server. When trying to read or write data from HBase, the clients read the required Region information from the .META table and directly communicate with the appropriate Region Server. Each Region is identified by the start key (inclusive) and the end key (exclusive).











HBase Data Model
HBase Data Model is designed to accommodate semi-structured data that could vary in data type, field size, and columns. The layout of the data model is in such a way that it becomes easier to partition the data and distribute the data across the cluster. The Data Model in HBase is made of different logical components such as Tables, Rows, Column Families, Column Qualifiers, Columns, Cells and Versions.
·         Tables: HBase organizes data into tables. Table names are Strings and composed of characters that are safe for use in a file system path. The HBase Tables are more like logical collection of rows stored in separate partitions called Regions.
·         Rows: Within a table, data is stored according to its row. Rows are identified uniquely by their row key. Row keys do not have a data type and are always treated as a byte[ ] (byte array).












·         Column Families: Data within a row is grouped by column family. Each Column Family has one more Columns and these Columns in a family are stored together in a low level storage file known as HFile. Column families also impact the physical arrangement of data stored in HBase. For this reason, they must be defined up front and are not easily modified. Every row in a table has the same column families, although a row need not store data in all its families. Column families are Strings and composed of characters that are safe for use in a file system path.
·         Column Qualifier: Data within a column family is addressed via its column qualifier, or simply, column. Column qualifiers need not be specified in advance. Column Qualifier consists of the Column Family name concatenated with the Column name using a colon – example: columnfamily:columnname. Column qualifiers need not be consistent between rows. Like row keys, column qualifiers do not have a data type and are always treated as a byte[ ].
·         Cell: A combination of row key, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell’s value. Values also do not have a data type and are always treated as a byte[ ].
·         Version/Timestamp: Values within a cell are versioned. Versions are identified by their version number, which by default is the timestamp of when the cell was written. Current timestamp is used, if a timestamp is not specified during a write. If the timestamp is not specified for a read, the latest one is returned. The number of cell value versions retained by HBase is configured for each column family. By default the number of cell versions is three. (Row, Family: Column, Timestamp) à Value
Storage Model
·         Column – oriented database (column families)
·         Table consists of Rows, each which has a primary key(row key)
·         Each Row may have any number of columns
·         Table schema only defines Column familes(column family can have any number of columns)
·         Each cell value has a timestamp
Access to data
·         Access data through table, Row key, Family, Column, Timestamp
·         API is very simple: Put, Get, Delete, Scan
·         A scan API allows you to efficiently iterate over ranges of rows and be able to limit which column are returned or the number of versions of each cell. You can match columns using filters and select versions using time ranges, specifying start and end times.
·         This is how row is stored as key/value



HBase and Zookeeper:
·         HBase uses Zookeeper extensively for region assignment. “Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group”
·         HBase can manage Zookeeper daemons for you or you can install/manage them separately



No comments:

Post a Comment