Monday, March 2, 2015

Hadoop HAWQ Installation using Ambari

Hadoop HAWQ Installation using Ambari

1.     Introduction
This document describes how to install HAWQ (Hadoop with Query) using Apache Ambari.

2.     Environment
This installation guide is appropriate for the below environment.
o    Apache Ambari 2.X
o    Hortonworks HDP 2.X
o    VMware vSphere 5.5 or later
o    RHEL/CentOS 6.5 or later
o    Internet Explorer 10 or later
o    Pivotal HDB (HAWQ) 2.0.0.0

o    Installation
1.     Overview
Below is the overview of the installation process that this document will describe.
·    Confirm prerequisites
·    Configure Isilon OneFS (Required if HDFS is deployed on Isilon)
·    Install Apache HAWQ using Ambari
·    Validate HAWQ deployment
·    Install HAWQ benchmark tool pivotalguru
·    Run HAWQ benchmark tool pivotalguru
o    Confirm Prerequisites
1.     Prepare VMware virtualized environment
Before you start the installation process, a VMware virtualized environment must be ready to provision the virtual machines required for this deployment. ESXi 5.5 and later revision is recommend for the virtualized environment.
o  Prepare host machines that will host HAWQ segment
·         Each host must meet the system requirements for the version of HAWQ you are installing
For the detail, please refer HAWQ release notes: http://hdb.docs.pivotal.io/hdb20/releasenotes/HAWQ20ReleaseNotes.html
·         Hosts that run other PostgreSQL instances cannot be used to run a default HAWQ master or standby service configuration. The default PostgreSQL port (5432) conflicts with the default port assignment of those services. You must either change the default port configuration of the running PostgreSQL instance or change the HAWQ master port setting during the HAWQ service installation to avoid port conflicts.

Note:
The Ambari server node uses PostgreSQL as the default metadata database.
The Hive Metastore uses MySQL as the default metadata database.
·         Each HAWQ segment must be co-located on a host that runs an HDFS data node.
·         The HAWQ master segment and standby master segment must be hosted on separate machines.
                   Prepare the host machines that will run PXF (Pivotal Extension Framework)
·         PXF must be installed on the HDFS NameNode and on all HDFS data nodes.
·         If you have configured Hadoop with high availability, PXF must also be installed on all HDFS nodes including all NameNode services.
·         If you want to use PXF with HBase or Hive, you must first install the HBase client (hbase-client) and/or Hive client (hive-client) on each machine where you intend to install PXF. See the HDP installation documentation for more information.

3.     Configure Isilon OneFS (Required if HDFS is deployed on Isilon)
1.      Login to the Isilon OneFS cluster node using root account
2.      Change the Isilon OneFS "Access Pattern" from "Concurrency" to "Streaming"
isi get /ifs
isi set -R -a streaming -l streaming /ifs
isi get /ifs
3. Change Isilon HDFS Global Settings
isi hdfs settings view
isi hdfs settings modify --server-threads=auto   #not required for OneFS 8.0
isi hdfs settings modify --default-block-size=128M
isi hdfs settings view
4. Create HAWQ user and group
isi auth groups create gpadmin --gid 503
isi auth users create gpadmin --enabled true --password  --uid 503 --primary-group gpadmin

4.     Install Apache HAWQ using Ambari (New Installation)
1.      Login to the Ambari server host machine as the root user 
2.      Install HAWQ Ambari plug-in
yum install hdb-ambari-plugin
3. Restart Ambari Server
service ambari-server restart
4. Add HDB repository to the Ambari server
·         This step is required if you have already installed a HDP cluster and are adding HDB to the existing cluster
·         This step is not required if you are installing a new HDP cluster and HDB together at the same time

python /var/lib/hawq/add_hdb_repo.py -u admin -p admin --repourl http://yumrepo.bigdata.emc.local/hdp/HDB/hdb-2.0.0.0
                   Access the Ambari web console at http://hdp-ambari.bigdata.emc.local:8080 and login with account admin/admin
                   Verify HDB component is now available
                   Customize HDFS configuration based on below table
Location
Property (Display Name)
Property (Actual Name)
Value
HDFS->Configs->Settings->DataNode
DataNode max data transfer threads
dfs.datanode.max.transfer.threads parameter
40960
HDFS->Configs->Advanced->DataNode
DataNode directories permission
dfs.datanode.data.dir.perm
750
HDFS->Configs->Advanced->General
Access time precision
dfs.namenode.accesstime.precision
0
HDFS->Configs->Advanced->Advanced hdfs-site

dfs.block.access.token.enable
false for an unsecured HDFS cluster
true for a secure cluster
HDFS->Configs->Advanced->Advanced hdfs-site
HDFS Short-circuit read
dfs.client.read.shortcircuit
true
HDFS->Configs->Advanced->Advanced hdfs-site
NameNode Server threads
dfs.namenode.handler.count
600
HDFS->Configs->Advanced->Advanced hdfs-site

dfs.support.append
true
HDFS->Configs->Advanced->Advanced hdfs-site

dfs.allow.truncate
true
HDFS->Configs->Advanced->Custom hdfs-site

dfs.block.local-path-access.user
gpadmin
HDFS->Configs->Advanced->Custom hdfs-site

dfs.client.socket-timeout
300000000
HDFS->Configs->Advanced->Custom hdfs-site

dfs.client.use.legacy.blockreader.local
false
HDFS->Configs->Advanced->Custom hdfs-site

dfs.datanode.handler.count
60
HDFS->Configs->Advanced->Custom hdfs-site

dfs.datanode.socket.write.timeout
7200000
HDFS->Configs->Advanced->Advanced core-site

ipc.client.connection.maxidletime
3600000
HDFS->Configs->Advanced->Custom core-site

ipc.client.connect.timeout
300000
HDFS->Configs->Advanced->Custom core-site

ipc.server.listen.queue.size
3300
Note: HAWQ requires that you enable dfs.allow.truncate. The HAWQ service will fail to start if dfs.allow.truncate is not set to “true.”

                   If Ambari indicates that a service must be restarted, click "Restart" and allow the service to restart before you continue
                   Select Actions > Add Service on the home page
                   Choose Services
·         HAWQ
·         PXF
                   Assign Masters
·         HDP Master Compute Node
·         HAWQ Master
·         HDP Worker Compute Node 1
·         HAWQ Standby Master
Notes:
·         Service "HAWQ Master" and "HAWQ Standby Master" must reside on separate hosts.
·         Service "HAWQ Master" component must not reside on the same host that is used for Hive Metastore if the Hive Metastore uses the new PostgreSQL database.
·    Assign Slaves and Clients
·         Isilon OneFS cluster node
·         HDP Master Compute node
·         HDP Worker Compute nodes
·         HAWQ Segment
·         PXF
Notes:
·         PXF must be installed on the HDFS NameNode, the Standby NameNode (if configured), and on each HDFS DataNode.
·         A HAWQ segment must be installed on each HDFS DataNode.

·    Customized Services
·         Assign HAWQ System User Password
·         Select "YARN" as Resource Manager
·         Customize HAWQ configuration based on below table
Location
Property (Actual Name)
Value
HAWQ->Configs->Advanced-> Advanced hawq-site
hawq_rm_yarn_address
Specify the address and port number of the YARN resource manager server
It is the value of yarn.resourcemanager.address in yarn-site.xml
For example: rm1.example.com:8050
HAWQ->Configs->Advanced-> Advanced hawq-site
hawq_rm_yarn_queue_name
default
HAWQ->Configs->Advanced-> Advanced hawq-site
hawq_rm_yarn_scheduler_address
Specify the address and port number of the YARN scheduler server
It is the value of yarn.resourcemanager.scheduler.address  in yarn-site.xml
For example: rm1.example.com:8030
HAWQ->Configs->Advanced-> Custom yarn-client
yarn.resourcemanager.ha
NOT REQUIRED IF NOT ENABLE YARN HA
Comma-delimited list of the fully qualified hostnames of the resource managers.
When high availability is enabled, YARN ignores the value in hawq_rm_yarn_address and uses this property’s value instead
For example, rm1.example.com:8050,rm2.example.com:8050
HAWQ->Configs->Advanced-> Custom yarn-client
yarn.resourcemanager.scheduler.ha
NOT REQUIRED IF NOT ENABLE YARN HA
Comma-delimited list of fully qualified hostnames of the resource manager schedulers.
When high availability is enabled, YARN ignores the value in hawq_rm_yarn_scheduler_address and uses this property’s value instead.
For example,rm1.example.com:8030,rm2.example.com:8030

Notes:
For more configuration for this step, please refer step 17 of this link: http://hdb.docs.pivotal.io/20/install/install-ambari.html
14. Review
Carefully review your configuration and then click "Deploy"
15. Complete Installation

5.     Validate HAWQ deployment
1.     Login HAWQ Master Node using root account
2.     Run below command to connect to HAWQ instance
su - gpadmin
source /usr/local/hawq/greenplum_path.sh
psql -d postgres
3.Run below SQL command to validate HAWQ
create database mytest;
\c mytest
create table t (i int);
insert into t select generate_series(1,100);
\timing
select count(*) from t;
drop table t;
\c template1;
drop database mytest;

6.     Install HAWQ benchmark tool pivotalguru
1.      Upload pivotalguru.zip to HAWQ master node directory /home/gpadmin
Notes:
If you use the pivotalguru.zip from GitHub, please change below parameters in file tpcds_variables.sh
REPO_URL="/home/gpadmin/pivotalguru/TPC-DS"
GEN_DATA_SCALE="1"
2. Add below line in file /home/gpadmin/.bashrc
/usr/local/hawq/greenplum_path.sh
export MASTER_DATA_DIRECTORY=/data/hawq/master
3. Run below command to install pivotalguru
clear &&
source /usr/local/hawq/greenplum_path.sh &&
export MASTER_DATA_DIRECTORY=/data/hawq/master &&
cd /home/gpadmin &&
unzip -o pivotalguru.zip &&
chown -R gpadmin:gpadmin /home/gpadmin/pivotalguru.zip &&
chmod +x /home/gpadmin/pivotalguru/*.sh &&
cd /home/gpadmin/pivotalguru &&
/home/gpadmin/pivotalguru/tpcds.sh &&
echo "Done"

7.     Run HAWQ benchmark tool pivotalguru


4.     References:
1.      Install Apache HAWQ using Ambari
2.      Integrating Pivotal HD and HAWQ with Isilon
3.      HDFS Isilon Integration
4.      Integrating Pivotal HD and HAWQ with Isilon
5.      How to enable pxf_isilon in HAWQ 1.2.1.1