BigData: An Introduction to Pig

An Introduction to Pig

Pig is an open source high-level data flow system. Apache Pig allows us to write complex MapReduce transformations using a simple scripting language (e.g. 10 lines of Pig script is equal to 200 lines of Java). This language is called Pig Latin. Pig Latin defines a set of transformations on a data set such as aggregate, join and sort. The statement written in Pig script is then translated by Pig into MapReduce so that it can be executed within Hadoop. Pig Latin may be extended using User Defined Functions (UDFs), which the user can write in Java, Python etc. or a scripting language and then call directly from the Pig Latin.

What Pig Does?

Pig was mainly designed for performing a long set of data operations, making it ideal for three categories of Big Data operations:

· Standard extract-transform-load (ETL) data pipelines,

· Research on raw data, and

· Iterative processing of data.

Pig Latin has the following key properties:

· Easy to Program: Complex tasks involving interrelated data transformations can be simplified and encoded as data flow sequences. Pig programs can be used to do huge tasks, but they are easy to write and maintain.

· Optimization opportunities: The encoded tasks have the ability to optimize their execution automatically which allows the user to concentrate on semantics rather than efficiency.

· Extensible: Pig users can create their own functions to meet their special-purpose processing.

Why to use Pig when MapReduce already exists?

From the figure below, it is clear that Pig takes 1/20th lines of code as compared to MapReduce program. Also, the development time for Pig is about 1/16th as compared to that of MapReduce program.

Advantages of Pig include:

· Pig is much higher level declarative language than MapReduce which increases programmer productivity, decreases duplication of effort and also opens the MapReduce programming system to more users.

· Pig insulates Hadoop complexity from the user.

· Pig is similar to SQL query where the user specifies the “what” and leaves the “how” to the underlying processing engine.

Pig is useful for:

· time sensitive data loads

· processing many data sources and

· for analytic insight through sampling.

Pig is not useful:

· When processing really nasty data formats of completely unstructured data (e.g. video, audio, raw human-readable text).

· It is definitely slow in comparison to MapReduce jobs.

· When more power is required to optimize the code.

Modes of User Interaction with Pig

Pig has two execution types:

· Local Mode: To run Pig in local mode, users need access to a single machine; all files are installed and run using the local host and file system. Specify local mode using the -x flag (pig -x local). Note that local mode does not support parallel mapper execution with Hadoop 0.20.x and 1.0.0

· MapReduce Mode: To run Pig in mapreduce mode, users need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; user can,but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).

Pig allows three modes of user interaction:

· Interactive mode: The user is presented with an interactive shell called Grunt, which accepts Pig commands. Compilation and execution plan is triggered only when the user asks for output through the STORE command.

· Batch mode: In this mode, a user submits a pre-written script called Pig script, containing a series of Pig commands, typically ending with STORE. The semantics are indetical to interactive mode.

· Embedded mode: Pig Latin commands can be submitted via method invocations from a Java program. This option permits dynamic construction of Pig Latin programs, as well as dynamic control flow.

The Programming Language

There are 3 main/basic parts in Pig programming language:

· The first step in a Pig program is to LOAD the data user want to manipulate from HDFS.

LOAD: Objects on which the processing is done is stored in Hadoop HDFS. In order for a Pig program to access this data and do transformations on the data, the program must first tell Pig what file (or files) it will require, and that’s done through the LOAD 'data_file' command (where 'data_file' specifies either an HDFS file or directory). If a directory is specified, all the files in that directory will be loaded into the program. If the data is stored in a file format that is not natively accessible to Pig, you can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in and interpret the data.

· Then the user runs the data through a set of transformations (which, internally, are translated into a set of map and reduce tasks).

TRANSFORM: All the data manipulation happens in the transformation logic. Here one can FILTER out rows that are not of interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER results, and much more.

· Finally, the user DUMPs the data to the screen or can STORE results in a file somewhere.

DUMP and STORE: The DUMP or STORE command is required to generate the results of a Pig program. The user can typically use the DUMP command to send the output to the screen to debug Pig programs. If the user need to do further processing or analysis on the generated output, user can change the DUMP call to a STORE call so that any results from running the programs are stored in a file. It is to be noted that user can use the DUMP command anywhere in the program to dump intermediate result sets to the screen, which is very useful for debugging purposes.

Pig Compilation and Execution Stages

A Pig program goes through a series of transformation steps before being executed as shown in the above figure.

· Parsing: This is the first step. The parser duty is to verify that the program is syntactically correct and that all referenced variables are defined. Type checking and schema inference can also be done by parser. Other checks, such as verifying the ability to instantiate classes corresponding to user-defined etc., also occur in this phase. A canonical logical plan with one-to-one correspondence between Pig Latin statements and logical operators which are arranged in directed acyclic graph(DAG) is the output of the parser.

· Logical Optimizer: The logical plan generated by the parser is then passed through a logical optimizer. In this stage, logical optimizations such as projection pushdown are carrie out.

· Map-Reduce Compiler and Map-Reduce Optimizer: The optimized logical plan is then compiled into a series of Map-Reduce jobs, which then pass through another optimization phase. An example of Map-Reduce-level optimization is utilizing the Map-Reduce combiner stage to perform early partial aggregation, in the case of distributive or algebric aggregaton functions.

· Hadoop Job Manager: The DAG of optimized Map-Reduce jobs is then topologically sorted, and then are submitted to Hadoop for execution in that order. Pig usually monitors the Hadoop execution status, and the user periodically gets the reports on the progress of the overall Pig program. Any warnings or errors that arise during execution are logged and reported to the user.

Pig Data Model

Data model can be defined as follows:

· A relation is a bag of tuples.

· A bag is a collection of tuples (duplicate tuples are allowed). It may be similar to a "table" in RDBMS, except that Pig does not require that the tuple field types match, or even that the tuples have the same number of fields. A bag is denoted by curly braces {}. (e.g. {(Alice, 1.0), (Alice, 2.0)}).

· A tuple is an ordered set of fields. Each field is a piece of data of any type (data atom, tuple or data bag) (e.g (Alice, 1.0) or (Alice, 2.0)).

· A field is a piece of data or a simple atomic value. (e.g. ‘Alice’ or ‘1.0’)

· A Data Map is a map from keys that are string literals to values that can be any data type. (Key is an atom while value can be of any type). Map is denoted by square brackets [].

Pig Latin Relational Operators

Loading and Storing:

· LOAD: Loads data from the file system or other storage into a relation.

· STORE: Saves a relation to the file system or other storage.

· DUMP: Prints a relation to the console.

Filtering:

· FILTER: Removes unwanted rows from a relation.

· DISTINCT: Removes duplicate rows from a relation.

· FOREACH…GENERATE: Generates data transformations based on columns of data.

· STREAM: Transforms a relation using an external program.

Grouping and Joining:

· JOIN: Joins two or more relations.

· COGROUP: Groups the data in two or more relations.

· GROUP: Groups the data in a single relation.

· CROSS: Creates the cross product of two or more relations.

Sorting:

· ORDER: Sorts a relation by one or more fields.

· LIMIT: Limits the size of a relation to a maximum number of tuples.

Combining and Splitting:

· UNION: Combine two or more relations into one.

· SPLIT: Splits a relation into two or more relations.

Difference between Pig and Hive

Features	Pig	Hive
Schemas/Types	Yes (implicit)	Yes (explicit)
Language	Pig Latin	SQL-like
JDBC/ODBC	No	Yes (Limited)
Server	No	Optional (Thrift)
DFS Direct Access	Yes (explicit)	Yes (implicit)
Partitions	No	Yes
Web Interface	No	Yes

BigData

Wednesday, April 6, 2016

An Introduction to Pig

No comments:

Post a Comment