BigData: 04/15/16

Pig Installation Guide

Environment Setup

1.1 Install latest Apache stable build.
1.2 Download Pig0.11.1 version.

Pig Installation

Download the stable version of Pig. In our case it is Pig 0.11.1. This version works with Hadoop 0.20.x, 0.23.x, 1.x, 2.x.

Execute following command to download Hive 01.2

wget  http://apache.mirrors.hoobly.com/pig/pig-0.11.1/pig-0.11.1.tar.gz

Copy the pig binaries into the folder /usr/local/pig, by executing the following command.
```
cp -r pig-0.11.1.tar.gz /usr/local/pig
```
Change the directory to /usr/local/pig by executing the following command
```
cd /usr/local/pig
```
Unzip the compressed pig file by executing the following command:
```
sudo tar –xvzf pig-0.11.1.tar.gz
```
Update the .bashrc file for hduser, so that certain pig parameters are set, every time the hduser logs in. Edit the .bashrc file and add the entries shown below.
```
$ vi .bashrc
export PIG_HOME='/usr/local/pig/pig-0.11.1'
export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$JAVA_HOME/bin:$PATH
```
Set the environment variable JAVA_HOME to point to the Java installation directory, which Pig uses internally.
```
export JAVA_HOME=<<Java_installation_directory>>
```
Compile the .bashrc file by executing the following command.
```
..bashrc
```
Pig is now set up and configured for further use.

Execution Modes in Pig

Pig has 2 modes of execution, local mode and MapReduce mode Both are described in detail below.

Local Mode

Local mode is used to verify and debug Pig scripts or queries. It is efficient for handling small datasets on a single machine. It runs on a single JVM and accesses the local filesystem.

To run in local mode, execute the following command.
```
$pig –x local
```
As soon as the above command runs, the grunt shell opens up where the user can run pig commands against the local filesystem.
```
grunt>
```

MapReduce Mode

This is the default mode Pig translates the queries into MapReduce jobs, which requires access to a Hadoop cluster and its filesystem.

To run in MapReduce mode, execute the following command

$pig 
As soon as the above command runs, the grunt shell opens up where
the user can run pig commands against the hadoop filesystem.
2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main –
Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main – 
Logging error messages to: /home/hduser/pig_1382985584762.log
2013-10-28 11:39:44,797 [main] INFO org.apache.pig.impl.util.Utils – 
Default bootup file /home/hduser/.pigbootup not found
2013-10-28 11:39:45,094 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine– 
Connecting to hadoop file system at:
hdfs://Hadoopmaster:54310
2013-10-28 11:39:45,592 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine– 
Connecting to map-reduce job tracker at:
Hadoopmaster:54311
grunt>

On viewing the log reports, you can see the filesystem and job tracker that Pig connects to.
Grunt is an interactive shell for running Pig queries and commands.
There are 3 ways to run Pig programs, one is to run them via Pig scripts, other is to use grunt shell to run interactive queries, and the 3^rd way is to embed a script into a Java code

Friday, April 15, 2016