Friday, April 15, 2016

Pig Installation Guide

Pig Installation Guide

Environment Setup

  • 1.1 Install latest Apache stable build.
  • 1.2 Download Pig0.11.1 version.

Pig Installation

Download the stable version of Pig. In our case it is Pig 0.11.1. This version works with Hadoop 0.20.x, 0.23.x, 1.x, 2.x.
  • Execute following command to download Hive 01.2
  • Copy the pig binaries into the folder /usr/local/pig, by executing the following command.
    cp -r pig-0.11.1.tar.gz /usr/local/pig
  • Change the directory to /usr/local/pig by executing the following command 
    cd /usr/local/pig
  • Unzip the compressed pig file by executing the following command:
    sudo tar –xvzf pig-0.11.1.tar.gz

  • Update the .bashrc file for hduser, so that certain pig parameters are set, every time the hduser logs in. Edit the .bashrc file and add the entries shown below.
    $ vi .bashrc
    export PIG_HOME='/usr/local/pig/pig-0.11.1'
    export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$JAVA_HOME/bin:$PATH
  • Set the environment variable JAVA_HOME to point to the Java installation directory, which Pig uses internally.
    export JAVA_HOME=<<Java_installation_directory>>
  • Compile the .bashrc file by executing the following command.
    ..bashrc
  • Pig is now set up and configured for further use.

Execution Modes in Pig

Pig has 2 modes of execution, local mode and MapReduce mode Both are described in detail below.

Local Mode

Local mode is used to verify and debug Pig scripts or queries. It is efficient for handling small datasets on a single machine. It runs on a single JVM and accesses the local filesystem.
  • To run in local mode, execute the following command.
    $pig –x local
    As soon as the above command runs, the grunt shell opens up where the user can run pig commands against the local filesystem.
    grunt>

MapReduce Mode

This is the default mode Pig translates the queries into MapReduce jobs, which requires access to a Hadoop cluster and its filesystem.
  • To run in MapReduce mode, execute the following command
    $pig 
    As soon as the above command runs, the grunt shell opens up where
    the user can run pig commands against the hadoop filesystem.
    2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main –
    Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
    2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main – 
    Logging error messages to: /home/hduser/pig_1382985584762.log
    2013-10-28 11:39:44,797 [main] INFO org.apache.pig.impl.util.Utils – 
    Default bootup file /home/hduser/.pigbootup not found
    2013-10-28 11:39:45,094 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine– 
    Connecting to hadoop file system at:
    hdfs://Hadoopmaster:54310
    2013-10-28 11:39:45,592 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine– 
    Connecting to map-reduce job tracker at:
    Hadoopmaster:54311
    grunt>
  • On viewing the log reports, you can see the filesystem and job tracker that Pig connects to.
  • Grunt is an interactive shell for running Pig queries and commands.
  • There are 3 ways to run Pig programs, one is to run them via Pig scripts, other is to use grunt shell to run interactive queries, and the 3rd way is to embed a script into a Java code