Wednesday, May 18, 2016

An Introduction to Hive

Big Data
An Introduction to Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HiveQL, which is an SQL-like language provided by Hive, provides a mechanism to project structure onto the data and query the data. Also this language allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Hive features include:
·         Tools to enable easy data extract/transform/load (ETL).
·         A mechanism to impose structure on a variety of data formats.
·         Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase.
·         Query execution via MapReduce.
·         Best used for batch jobs over large sets of append-only data (like web logs).
·         Targeted towards users comfortable with SQL.
·         Abstract complexity of Hadoop.
·         Developed by Facebook and contributed to community.
·         QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).
·         There is no such thing as “Hive Format” that has to be used to read or write data in Hive. Hive works equally well on Thrift, control delimited, or any of your specialized data formats.
·         By default, Hive stores metadata in an embedded Apache Derby database, and other databases like MySQLcan optionally be used.
·         Currently, there are four file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC andRCFILE.
Hive applications include:
·         Data Mining
·         Document Indexing
·         Predictive modeling, and Hypothesis testing
·         Customer-facing Business Intelligence (e.g., Google Analytics)
·         Log processing
·         Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates.
Hive Components:
The figure below shows the major components of Hive and its interactions with Hadoop.
The main components of Hive are:
·         Shell: It allows interactive queries (CLI) like MySQL shell connected to database. It also supports web and JDBC client.
·         Thrift Server: This server exposes a very simple client API to execute HiveQL statements.
·         Metastore: The metastore is the system catalog. All other components of Hive interact with the metastore. Driver: It manages life cycle of HiveQL statement during compilation, optimization and execution. On receiving the HiveQL statement, from the Thrift server or some other interfaces, it creates a session handle which is later used to keep track of statistics like execution time, number of output rows, etc.
·         Query Compiler: The compiler is invoked by driver upon receiving a HiveQL statement. The compiler then translates this statement into a plan which consists of a DAG (directed acyclic graph) of map-reduce jobs.
·         Execution Engine: The driver submits the individual map-reduce jobs from the DAG to the execution engine in a topological order. The execution engine used by Hive currently is Hadoop.




The UI calls the execute interface to the Driver as shown by the step 1 in the figure. The Driver then creates a session handle for that query and sends that query to the compiler to generate an execution plan in step 2. The compiler then gets the necessary metadata from the metastore(steps 3 and 4). This metadata is used to check the expressions in the query tree as well as to prune partitions based on query predicates. The plan generated by the compiler (step 5) is a DAG (directed acyclic graph) of stages with each stage being either a map/reduce job, a metadata operation or an operations on hdfs. For map/reduce stages, the plan contains map operator trees and a reduce operator tree. The execution engines submits these stages to appropriate components as shown in steps 6, 6.1, 6.2 and 6.3 steps. In each task (mapper/reducer) the deserializer associated with the table or intermediate outputs is used to read the rows from hdfs files and these are passed through the associated operator tree. Once the output is generated, it is written to a temporary hdfs file though the serializer (this happens in the mapper in case the operation does not need a reducer). The temporary files are used to provide data to subsequent map/reduce stages of the plan. For DML operations the final temporary file is moved to location of the tables. This scheme is used to ensure that dirty data is not read. For queries, the contents of the temporary file are read by the execution engine directly from hdfs as part of the fetch call from the Driver (steps 7, 8 and 9).


The figure above shows a typical flow of query through the system.
·         Metastore: It is the system catalog which contains metadata about all the tables that are stored in Hive. During the table creation, this metadata is specified and reused every time the table is referenced/used/queried in HiveQL. This component (metastore) distinguishes Hive as a traditional warehousing solution when it is compared with similar data processing systems which are also built on top of map-reduce like architectures like Pig. Metastore provides two important features of a data warehouse: data abstraction and data discovery. The metastore contains the following objects:
·         Database: It is a namespace for tables. ‘Default’ is the database which is used for tables with no user supplied database name.
·         Table: The metadata for table contains list of columns with their types, its owner, storage and SerDe information. If a user supply some key and value data, that can also be stored in metadata. Storage information includes location of the table’s data in the file system, data formats and bucketing information. Metadata about the SerDe includes the implementation class of serializer and deserializer methods and any supported information. All of this information can be provided during the creation of table itself.
·         Partition: Each partition can have its own columns and SerDe and the associated storage information. This facilitates schema changes without affecting older partitions.
Hive Data Model- Date in Hive is organized into the following:
·         Tables: These are similar to Tables in RDBMS. Tables can be filtered, projected, joined and also union can be used on tables. Also all the data of a table is stored in a directory in HDFS. Hive also supports notion of external tables wherein a table can be created on pre-existing files or directories in HDFS by providing the appropriate location to the table creation DDL. The rows in a table are organized into typed columns (int, float, string, date, Boolean) similar to Relational Databases.
·         Partitions: Each Table can have one or more partition keys which determine how the data is stored e.g. a table A with a date partition column dt had files with data for a particular date stored in the <table location>/dt=<date> directory in HADF. Partitions allow the system to prune/filter data to be inspected based on query predicates/conditions, e.g. a query that in interested in rows from A that satisfy the condition A.dt = '2008-09-01' would only have to look at files in <table location>/dt=2008-09-01/ directory in HDFS.
·         Buckets: Data in each partition may in further be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use SAMPLE clause on the table). Buckets are useful for sampling and join optimization.
Hive Physical Layout
·         Its default warehouse directory is in HDFS.
·         e.g. /user/hive/warehouse
·         Tables are stored in the subdirectories of this warehouse directory. Partitions form subdirectories of tables.
·         The actual data is stored in flat files. Flat files can be character-delimited text or Sequence files. Arbitrary format files can also be used with custom SerDe.
Abilities of Hive Query Language
·         Hive Query Language (HiveQL) provides the basic SQL-like operations.
·         Ability to filter rows from a table using a “where” clause.
·         Ability to do equijoins between two tables.
·         Ability to store results of a query in Hadoop DFS directory.
·         Ability to manage tables and partitions. (create, drop and alter)
·         Ability to store results of query into another table.
Difference between Hive and RDBMS
Schema on Read vs. Schema on Write:
·         Rather than verifying the data when it is loaded into the Hive, it verifies the data when a query is issued.
·         Since the data does not have to be read, parsed and serialized to disk in the database’s internal format at the time of loading, schema on read makes the initial load very fast.. The load operation is just a file copy or move.
·         There is no update, transaction and index feature possible in Hive.