Tuesday, May 17, 2016

Hadoop Single Node Installation Guide

Hadoop Single Node Installation Guide

Environment Setup

  • Download Ubuntu Linux 12.04.3 LTS.
  • Download Hadoop 1.2.1, released August 2013.
  • Once the Linux server is set-up, install openssh, by executing the following command. 
    sudo apt-get install openssh-server

Pre-requisites

There are certain required steps to be done before starting the Hadoop Installation. They are listed as below. Each step is further described below with screenshots and commands for additional understanding. Before proceeding, make sure the Linux box is updated with the latest set of packages from all repositories and PPA’s. It can be updated by executing the following command.
sudo apt-get update

Install Java 1.6+

Hadoop requires an installation of Java 1.6 or higher. It is always better to go with the latest Java version. In our case we have installed Java 7u-25.
  • Download the latest Oracle Java for Linux from the Oracle website by executing the following command.
  • In the event, the download fails, try executing the below command which bypasses the username and password.
    wget --no-cookies --no-check-certificate --header "Cookie:
    gpw_e24=http%3A%2F%2Fwww.oracle.com"
    "https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u25-linux-x64.tar.gz 
  • Copy the downloaded Java binaries to the folder /usr/local/java, by executing the following command.
    sudo cp -r jdk-7u25-linux-x64.tar.gz /usr/local/java
  • Go to the directory /usr/local/java by executing the following command
    cd /usr/local/java
  • Unzip the downloaded Java file by executing the following commands. This unzips the java binaries in the folder /usr/local/java.
    sudo tar xvzf jdk-7u25-linux-x64.tar.gz 
  • Add the following system variuables to the path by editing the system Path file located in /etc/profile.
    sudo nano /etc/profile OR sudo gedit /etc/profile
  • Add the following lines below to the end of your /etc/profile file:
    JAVA_HOME=/usr/local/java/jdk1.7.0_25
    PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
    export JAVA_HOME
    export PATH
    
  • Execute the following command to indicate where Oracle Java JDK/JRE is located. This will tell the system that the new Oracle Java version is available for use.
    sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.7.0_40/bin/javac"
  • Reload your system wide PATH /etc/profile by executing the following command: 
    ./etc/profile
  • Check the java version, to find out if java was installed successfully. 
    java –version

Add a dedicated Hadoop user account

While it is not required, it is generally recommended to create a separate user account to run the Hadoop installation.
  • Start by adding a group by executing the following command.
    sudo addgroup hadoop
  • Create a user (hduser) and add it to the group (hadoop) created above.
    sudo adduser –ingroup hadoop hduser
    You will be asked to provide the password and other information.

Configuring SSH Access

Hadoop requires SSH access to manage its nodes, i.e. remote machines and your local machine if you want to use Hadoop on it. This access allows master node to login to it’s slave nodes and to start and stop the services on them. For single node setup of Hadoop (as in this tutorial), we need to configure SSH access to localhost for the hduser user we created in the step above.
Make sure that SSH is up and running on the Linux box, and configured to allow SSH public key authentication.
  • Login to the Linux box as ‘hduser’.
  • Generate SSH key for user hduser, by executing the following command.
    ssh-keygen -t rsa -P ""
  • You will be asked the location where you would like to save the key file. Just click <Enter> to go with the default. The RSA key will be generated at ‘/home/hduser/.ssh’. This key generated with an empty password. Generally, it is not a good practice to have empty passwords, but in this case we need that there should not be any manual intervention when Hadoop is interacting with its nodes.
  • Enable SSH Access to your local machine with this newly created key, by executing the following command.
    cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
  • Last step is to test the ssh setup by connecting your local machine with the hduser user. This will add your machine’s host key fingerprint to the hduser user’s known_hosts file. 
    ssh hduser@localhost

Disable IPv6

One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of the Ubuntu box.
  • Log in as the root. Open the file /etc /sysctl.conf by executing the following command.
    sudo gedit /etc/sysctl.conf
  • Add the following lines to the end of the file and reboot the machine, to update the configurations correctly. 
    #disable ipv6
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1

Hadoop Installation

Download the latest Hadoop version. In our case it is Hadoop 1.2.1.
  • Execute following command to download Hadoop version 1.2.1
  • Unzip the compressed hadoop file by executing the following command:
    tar –xvzf hadoop-1.2.1.tar.gz
  • Rename the hadoop-1.2.1 directory to hadoop by executing the following command.
    mv hadoop-1.2.1 hadoop
  • Move hadoop directory and its contents to the location of your choice. In our case it is /usr/local.
    sudo mv hadoop /usr/local/
  • Change the ownership of the files to user hduser and group hadoop.
    sudo chown -R hduser:hadoop Hadoop

Hadoop Configuration

There are certain files that need to be updated and certain configuration steps to be completed, before Hadoop is up and running. The configurations to be done are listed below. Each step is further described below with screenshots and commands for additional understanding.

4.1 Update $Home/.bashrc file

  • Execute the following command to edit the .bashrc file. 
    vi .bashrc
  • Add the following lines to the end of the file.
    export JAVA_HOME=’/usr/local/java/jdk1.7.0_25′
    export HADOOP_HOME=’/usr/local/hadoop’
    export PATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH
  • Reload the .bashrc file by executing the following command.
    ..bashrc

Configure hadoop-env.sh.

The only required environment variable we have to configure for Hadoop in this guide is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 7 directory.
export JAVA_HOME=/usr/local/java/jdk1.7.0_25


Configure /conf/*-site.xml files

In this section, we will configure the directory where Hadoop stores its data files, the network ports it listens to etc.
  • Login as a non hadoop user. Create a directory called data, by executing the following command.
    sudo mkdir /data
  • Change the ownership of this folder to user hduser.
    sudo chown hduser:hadoop /data
  • Create a directory tmp under the data directory.
    mkdir /data/tmp
  • Login as user hduser. Edit the core-site.xml in /usr/local/hadoop/conf directory 
    su -hduser
    cd /usr/local/hadoop/conf
    vi core-site.xml
    
  • Add the following entry to the file. Save and quit the file.
    <property>
    <name>hadoop.tmp.dir</name>
    <value>/data/tmp</value>
    <description>A base for other temporary directories.</description>
    </property>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
    <description>The name of the default file system.
    A URI whose scheme and authority determine the FileSystem implementation. 
    The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem
    implementation class. The uri's authority is used to determine the host, port,
    etc. for a filesystem.</description> 
    </property>
  • Edit the mapred-site.xml in /usr/local/hadoop/conf directory
    vi mapred-site.xml
  • Add the following entry to the file and save and quit the file. 
    <property>
    <name>mapred.job.tracker</name>
    <value>localhost:54311</value>
    <description>The host and port that the MapReduce job tracker runs at. If "local",
    then jobs are run in-process as a single map and reduce task.
    </description>
    </property>
  • Edit the hdfs-site.xml in /usr/local/hadoop/conf directory
    vi hdfs-site.xml
  • Add the following entry to the file and save and quit the file.
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    <description>Default block replication. The actual number of replications can be specified
    when the file is created. The default is used if replication is not specified in create time.
    </description>
    </property

Format the HDFS File system and Starting Hadoop server

The first step to starting your Hadoop installation is the formatting of the Hadoop file system (HDFS) implemented on top of your local filesystem of your cluster. This step is required the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!
  • To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), execute the following command in the $HADOOP_HOME/bin directory
    hadoop namenode –format

  • Start the Hadoop server by executing the following command 
    start-all.sh
  • Run jps command to see your all the services up and running
  • Run netstat –plten | grep java to see list of ports running.
  • Stop Hadoop by running the following command
    stop-all.sh