Big Data
An Introduction to Hive
Apache
Hive is a data warehouse infrastructure
built on top of Hadoop for providing
data summarization, query, and analysis. HiveQL, which is an SQL-like language
provided by Hive, provides a mechanism to project structure onto the data and
query the data. Also this language allows traditional map/reduce programmers to
plug in their custom mappers and reducers when it is inconvenient or
inefficient to express this logic in HiveQL.
Hive
features include:
·
Tools to enable easy data extract/transform/load (ETL).
·
A mechanism to impose structure on a variety of data formats.
·
Access to files stored either directly in Apache HDFS or in
other data storage systems such as Apache HBase.
·
Query execution via MapReduce.
·
Best used for batch jobs over large sets of append-only data
(like web logs).
·
Abstract complexity of Hadoop.
·
Developed by Facebook and contributed to community.
·
QL can also be extended with custom scalar functions (UDF's),
aggregations (UDAF's), and table functions (UDTF's).
·
There is no such thing as “Hive Format” that has to be used to
read or write data in Hive. Hive works equally well on Thrift, control
delimited, or any of your specialized data formats.
·
By default, Hive stores metadata in an embedded Apache Derby database, and other
databases like MySQLcan optionally be used.
·
Currently, there are four file formats supported in Hive, which
are TEXTFILE, SEQUENCEFILE, ORC andRCFILE.
Hive
applications include:
·
Data Mining
·
Document Indexing
·
Predictive modeling, and Hypothesis testing
·
Customer-facing Business Intelligence (e.g., Google Analytics)
·
Log processing
·
Hive is not designed for OLTP workloads and
does not offer real-time queries or row-level updates.
Hive
Components:
The
figure below shows the major components of Hive and its interactions with
Hadoop.
The
main components of Hive are:
·
Shell: It allows interactive
queries (CLI) like MySQL shell connected to database. It also supports web and
JDBC client.
·
Thrift Server: This
server exposes a very simple client API to execute HiveQL statements.
·
Metastore: The metastore is the
system catalog. All other components of Hive interact with the metastore.
Driver: It manages life cycle of HiveQL statement during compilation,
optimization and execution. On receiving the HiveQL statement, from the Thrift
server or some other interfaces, it creates a session handle which is later
used to keep track of statistics like execution time, number of output rows,
etc.
·
Query Compiler: The
compiler is invoked by driver upon receiving a HiveQL statement. The compiler
then translates this statement into a plan which consists of a DAG (directed
acyclic graph) of map-reduce jobs.
·
Execution Engine: The
driver submits the individual map-reduce jobs from the DAG to the execution
engine in a topological order. The execution engine used by Hive currently is
Hadoop.
The UI calls the execute interface to the
Driver as shown by the step 1 in the figure. The Driver then creates a session
handle for that query and sends that query to the compiler to generate an
execution plan in step 2. The compiler then gets the necessary metadata from
the metastore(steps 3 and 4). This metadata is used to check the expressions in
the query tree as well as to prune partitions based on query predicates. The
plan generated by the compiler (step 5) is a DAG (directed acyclic graph) of
stages with each stage being either a map/reduce job, a metadata operation or
an operations on hdfs. For map/reduce stages, the plan contains map operator
trees and a reduce operator tree. The execution engines submits these stages to
appropriate components as shown in steps 6, 6.1, 6.2 and 6.3 steps. In each
task (mapper/reducer) the deserializer associated with the table or intermediate
outputs is used to read the rows from hdfs files and these are passed through
the associated operator tree. Once the output is generated, it is written to a
temporary hdfs file though the serializer (this happens in the mapper in case
the operation does not need a reducer). The temporary files are used to provide
data to subsequent map/reduce stages of the plan. For DML operations the final
temporary file is moved to location of the tables. This scheme is used to
ensure that dirty data is not read. For queries, the contents of the temporary
file are read by the execution engine directly from hdfs as part of the fetch
call from the Driver (steps 7, 8 and 9).
The
figure above shows a typical flow of query through the system.
·
Metastore: It is the system
catalog which contains metadata about all the tables that are stored in Hive.
During the table creation, this metadata is specified and reused every time the
table is referenced/used/queried in HiveQL. This component (metastore) distinguishes
Hive as a traditional warehousing solution when it is compared with similar
data processing systems which are also built on top of map-reduce like
architectures like Pig. Metastore provides two important features of a data
warehouse: data abstraction and data discovery. The metastore contains the
following objects:
·
Database: It is a namespace for
tables. ‘Default’ is the database which is used for tables with no user
supplied database name.
·
Table: The metadata for table
contains list of columns with their types, its owner, storage and SerDe
information. If a user supply some key and value data, that can also be stored
in metadata. Storage information includes location of the table’s data in the
file system, data formats and bucketing information. Metadata about the SerDe
includes the implementation class of serializer and deserializer methods and
any supported information. All of this information can be provided during the
creation of table itself.
·
Partition: Each partition can have
its own columns and SerDe and the associated storage information. This
facilitates schema changes without affecting older partitions.
Hive
Data Model- Date in Hive is organized into the following:
·
Tables: These are similar to
Tables in RDBMS. Tables can be filtered, projected, joined and also union can
be used on tables. Also all the data of a table is stored in a directory in
HDFS. Hive also supports notion of external tables wherein a table can be
created on pre-existing files or directories in HDFS by providing the appropriate
location to the table creation DDL. The rows in a table are organized into
typed columns (int, float, string, date, Boolean) similar to Relational
Databases.
·
Partitions: Each Table can have one
or more partition keys which determine how the data is stored e.g. a table A
with a date partition column dt had files with data for a particular date
stored in the <table location>/dt=<date> directory in HADF.
Partitions allow the system to prune/filter data to be inspected based on query
predicates/conditions, e.g. a query that in interested in rows from A that
satisfy the condition A.dt = '2008-09-01' would only have to look at files in
<table location>/dt=2008-09-01/ directory in HDFS.
·
Buckets: Data in each partition
may in further be divided into Buckets based on the hash of a column in the
table. Each bucket is stored as a file in the partition directory. Bucketing
allows the system to efficiently evaluate queries that depend on a sample of
data (these are queries that use SAMPLE clause on the table). Buckets are
useful for sampling and join optimization.
Hive
Physical Layout
·
Its default warehouse directory is in HDFS.
·
e.g. /user/hive/warehouse
·
Tables are stored in the subdirectories of this warehouse
directory. Partitions form subdirectories of tables.
·
The actual data is stored in flat files. Flat files can be
character-delimited text or Sequence files. Arbitrary format files can also be
used with custom SerDe.
Abilities
of Hive Query Language
·
Hive Query Language (HiveQL) provides the basic SQL-like
operations.
·
Ability to filter rows from a table using a “where” clause.
·
Ability to do equijoins between two tables.
·
Ability to store results of a query in Hadoop DFS directory.
·
Ability to manage tables and partitions. (create, drop and
alter)
·
Ability to store results of query into another table.
Difference
between Hive and RDBMS
Schema
on Read vs. Schema on Write:
·
Rather than verifying the data when it is loaded into the Hive,
it verifies the data when a query is issued.
·
Since the data does not have to be read, parsed and serialized
to disk in the database’s internal format at the time of loading, schema on
read makes the initial load very fast.. The load operation is just a file copy
or move.
·
There is no update, transaction and index
feature possible in Hive.