An Introduction to Pig
Pig is
an open source high-level data flow system. Apache Pig allows us to write
complex MapReduce transformations using a simple scripting language (e.g. 10
lines of Pig script is equal to 200 lines of Java). This language is called Pig
Latin. Pig Latin defines a set of transformations on a data set such as
aggregate, join and sort. The statement written in Pig script is then
translated by Pig into MapReduce so that it can be executed within Hadoop. Pig
Latin may be extended using User Defined Functions (UDFs), which the user can
write in Java, Python etc. or a scripting language and then call directly from
the Pig Latin.
What
Pig Does?
Pig was mainly designed for performing a long
set of data operations, making it ideal for three categories of Big Data operations:
·
Standard extract-transform-load (ETL) data pipelines,
·
Research on raw data, and
·
Iterative processing of data.
Pig
Latin has the following key properties:
·
Easy to Program:
Complex tasks involving interrelated data transformations can be simplified and
encoded as data flow sequences. Pig programs can be used to do huge tasks, but
they are easy to write and maintain.
·
Optimization opportunities: The
encoded tasks have the ability to optimize their execution automatically which
allows the user to concentrate on semantics rather than efficiency.
·
Extensible: Pig users can create
their own functions to meet their special-purpose processing.
Why to
use Pig when MapReduce already exists?
From the figure below, it is clear that Pig
takes 1/20th lines of code as compared to MapReduce program. Also, the
development time for Pig is about 1/16th as compared to that of MapReduce
program.
Advantages
of Pig include:
·
Pig is much higher level declarative language than MapReduce
which increases programmer productivity, decreases duplication of effort and
also opens the MapReduce programming system to more users.
·
Pig insulates Hadoop complexity from the user.
·
Pig is similar to SQL query where the user specifies the “what”
and leaves the “how” to the underlying processing engine.
Pig is
useful for:
·
time sensitive data loads
·
processing many data sources and
·
for analytic insight through sampling.
Pig is
not useful:
·
When processing really nasty data formats of completely
unstructured data (e.g. video, audio, raw human-readable text).
·
It is definitely slow in comparison to MapReduce jobs.
·
When more power is required to optimize the code.
Modes
of User Interaction with Pig
Pig has
two execution types:
·
Local Mode: To run Pig in local
mode, users need access to a single machine; all files are installed and run
using the local host and file system. Specify local mode using the -x flag (pig
-x local). Note that local mode does not support parallel mapper execution with
Hadoop 0.20.x and 1.0.0
·
MapReduce Mode: To
run Pig in mapreduce mode, users need access to a Hadoop cluster and HDFS
installation. Mapreduce mode is the default mode; user can,but don't need to,
specify it using the -x flag (pig OR pig -x mapreduce).
Pig
allows three modes of user interaction:
·
Interactive mode: The
user is presented with an interactive shell called Grunt, which accepts Pig
commands. Compilation and execution plan is triggered only when the user asks
for output through the STORE command.
·
Batch mode: In this mode, a user
submits a pre-written script called Pig script, containing a series of Pig
commands, typically ending with STORE. The semantics are indetical to
interactive mode.
·
Embedded mode: Pig
Latin commands can be submitted via method invocations from a Java program.
This option permits dynamic construction of Pig Latin programs, as well as
dynamic control flow.
The
Programming Language
There are 3 main/basic parts in Pig
programming language:
·
The first step in a Pig program is to LOAD the
data user want to manipulate from HDFS.
LOAD: Objects on which the processing is done is stored in Hadoop
HDFS. In order for a Pig program to access this data and do transformations on
the data, the program must first tell Pig what file (or files) it will require,
and that’s done through the LOAD 'data_file' command (where 'data_file'
specifies either an HDFS file or directory). If a directory
is specified, all the files in that directory will be loaded into the program.
If the data is stored in a file format that is not natively accessible to Pig,
you can optionally add the USING function to the LOAD
statement to specify a user-defined function that can read in and interpret the
data.
·
Then the user runs the data through a set of transformations (which,
internally, are translated into a set of map and reduce tasks).
TRANSFORM: All the data manipulation happens in the
transformation logic. Here one can FILTER out rows that are not of interest,
JOIN two sets of data files, GROUP data to build aggregations, ORDER results,
and much more.
·
Finally, the user DUMPs the data to the screen or
can STORE results in a file somewhere.
DUMP and STORE: The DUMP or STORE
command is required to generate the results of a Pig program. The user can
typically use the DUMP command to send the output to the screen to debug Pig
programs. If the user need to do further processing or analysis on the
generated output, user can change the DUMP call to a STORE call so that any
results from running the programs are stored in a file. It is to be noted that
user can use the DUMP command anywhere in the program to dump intermediate
result sets to the screen, which is very useful for debugging purposes.
Pig
Compilation and Execution Stages
A Pig program goes through a series of
transformation steps before being executed as shown in the above figure.
·
Parsing: This is the first step.
The parser duty is to verify that the program is syntactically correct and that
all referenced variables are defined. Type checking and schema inference can
also be done by parser. Other checks, such as verifying the ability to
instantiate classes corresponding to user-defined etc., also occur in this
phase. A canonical logical plan with one-to-one correspondence between Pig
Latin statements and logical operators which are arranged in directed acyclic
graph(DAG) is the output of the parser.
·
Logical Optimizer: The
logical plan generated by the parser is then passed through a logical optimizer.
In this stage, logical optimizations such as projection pushdown are carrie
out.
·
Map-Reduce Compiler and Map-Reduce Optimizer: The
optimized logical plan is then compiled into a series of Map-Reduce jobs, which
then pass through another optimization phase. An example of Map-Reduce-level
optimization is utilizing the Map-Reduce combiner stage to perform early
partial aggregation, in the case of distributive or algebric aggregaton
functions.
·
Hadoop Job Manager: The
DAG of optimized Map-Reduce jobs is then topologically sorted, and then are
submitted to Hadoop for execution in that order. Pig usually monitors the
Hadoop execution status, and the user periodically gets the reports on the
progress of the overall Pig program. Any warnings or errors that arise during
execution are logged and reported to the user.
Pig
Data Model
Data model can be defined as follows:
·
A relation is a bag of tuples.
·
A bag is a collection of tuples (duplicate
tuples are allowed). It may be similar to a "table" in RDBMS, except
that Pig does not require that the tuple field types match, or even that the
tuples have the same number of fields. A bag is denoted by curly braces {}.
(e.g. {(Alice, 1.0), (Alice, 2.0)}).
·
A tuple is an ordered set of fields. Each field
is a piece of data of any type (data atom, tuple or data bag) (e.g (Alice, 1.0)
or (Alice, 2.0)).
·
A field is a piece of data or a simple atomic
value. (e.g. ‘Alice’ or ‘1.0’)
·
A Data Map is a map from keys that are string
literals to values that can be any data type. (Key is an atom while value can
be of any type). Map is denoted by square brackets [].
Pig
Latin Relational Operators
Loading
and Storing:
·
LOAD: Loads data from the
file system or other storage into a relation.
·
STORE: Saves a relation to the
file system or other storage.
·
DUMP: Prints a relation to
the console.
Filtering:
·
FILTER: Removes unwanted rows
from a relation.
·
DISTINCT: Removes duplicate rows
from a relation.
·
FOREACH…GENERATE:
Generates data transformations based on columns of data.
·
STREAM: Transforms a relation
using an external program.
Grouping
and Joining:
·
JOIN: Joins two or more
relations.
·
COGROUP: Groups the data in two
or more relations.
·
GROUP: Groups the data in a
single relation.
·
CROSS: Creates the cross
product of two or more relations.
Sorting:
·
ORDER: Sorts a relation by one
or more fields.
·
LIMIT: Limits the size of a
relation to a maximum number of tuples.
Combining
and Splitting:
·
UNION: Combine two or more
relations into one.
·
SPLIT: Splits a relation into
two or more relations.
Difference
between Pig and Hive
Features
|
Pig
|
Hive
|
Schemas/Types
|
Yes (implicit)
|
Yes (explicit)
|
Language
|
Pig Latin
|
SQL-like
|
JDBC/ODBC
|
No
|
Yes (Limited)
|
Server
|
No
|
Optional (Thrift)
|
DFS Direct Access
|
Yes (explicit)
|
Yes (implicit)
|
Partitions
|
No
|
Yes
|
Web Interface
|
No
|
Yes
|
No comments:
Post a Comment