Hadoop HAWQ Installation using Ambari
1.     Introduction
This document describes how to install HAWQ
(Hadoop with Query) using Apache Ambari.
2.     Environment
This installation guide is appropriate for the
below environment.
o    Apache Ambari 2.X
o    Hortonworks HDP 2.X
o    VMware vSphere 5.5 or later
o    RHEL/CentOS 6.5 or later
o    Internet Explorer 10 or later 
o    Pivotal HDB (HAWQ) 2.0.0.0
o   
Installation
1.     Overview
Below is the overview of the installation
process that this document will describe. 
·   
Confirm prerequisites
·   
Configure Isilon OneFS (Required if HDFS is deployed on Isilon)
·   
Install Apache HAWQ
using Ambari
·   
Validate HAWQ deployment
·   
Install HAWQ benchmark
tool pivotalguru
·   
Run HAWQ benchmark tool
pivotalguru
o   
Confirm
Prerequisites
1.     Prepare VMware virtualized environment
Before you start the installation process, a
VMware virtualized environment must be ready to provision the virtual machines
required for this deployment. ESXi 5.5 and later revision is recommend for the
virtualized environment.
o  Prepare host machines that will host HAWQ
segment
·        
Each
host must meet the system requirements for the version of HAWQ you are
installing
For the detail, please refer HAWQ release notes:
http://hdb.docs.pivotal.io/hdb20/releasenotes/HAWQ20ReleaseNotes.html
·        
Hosts that run other
PostgreSQL instances cannot be used to run a default HAWQ master or standby
service configuration. The default PostgreSQL port (5432) conflicts with the
default port assignment of those services. You must either change the default
port configuration of the running PostgreSQL instance or change the HAWQ master
port setting during the HAWQ service installation to avoid port conflicts.
Note: 
The Ambari server node uses PostgreSQL as the
default metadata database. 
The Hive Metastore uses MySQL as the default
metadata database.
·        
Each HAWQ segment must
be co-located on a host that runs an HDFS data node.
·        
The HAWQ master segment
and standby master segment must be hosted on separate machines.
                  
Prepare
the host machines that will run PXF (Pivotal Extension Framework)
·        
PXF
must be installed on the HDFS NameNode and on all HDFS data nodes.
·        
If
you have configured Hadoop with high availability, PXF must also be installed
on all HDFS nodes including all NameNode services.
·        
If
you want to use PXF with HBase or Hive, you must first install the HBase client
(hbase-client) and/or Hive client (hive-client) on each machine where you
intend to install PXF. See the HDP installation documentation for more
information.
3.     Configure Isilon OneFS (Required if HDFS is deployed on Isilon)
1.      Login to the Isilon OneFS cluster node using root account
2.      Change the Isilon OneFS "Access Pattern" from "Concurrency" to "Streaming"
isi get /ifs
isi set -R -a streaming -l
streaming /ifs
isi get /ifs
3. Change Isilon HDFS Global Settings
isi hdfs settings view
isi hdfs settings modify
--server-threads=auto   #not required for
OneFS 8.0
isi hdfs settings modify
--default-block-size=128M
isi hdfs settings view
4. Create HAWQ user and group
isi auth groups create
gpadmin --gid 503
isi auth users create gpadmin
--enabled true --password  --uid 503
--primary-group gpadmin
4.     Install Apache HAWQ using Ambari (New
Installation)
1.      Login to the Ambari server host machine as the root user 
2.      Install HAWQ Ambari plug-in
yum install hdb-ambari-plugin
3. Restart Ambari Server
service ambari-server restart
4. Add HDB repository to the Ambari server
·        
This step is required if
you have already installed a HDP cluster and are adding HDB to the existing
cluster
·        
This step is not required
if you are installing a new HDP cluster and HDB together at the same time
python
/var/lib/hawq/add_hdb_repo.py -u admin -p admin --repourl http://yumrepo.bigdata.emc.local/hdp/HDB/hdb-2.0.0.0
                  
Access the Ambari web
console at http://hdp-ambari.bigdata.emc.local:8080 and login with account admin/admin
                  
Verify HDB component is
now available
                  
Customize HDFS
configuration based on below table
| 
   
Location 
 | 
  
   
Property (Display Name) 
 | 
  
   
Property (Actual Name) 
 | 
  
   
Value 
 | 
 
| 
   
HDFS->Configs->Settings->DataNode 
 | 
  
   
DataNode max data transfer threads 
 | 
  
   
dfs.datanode.max.transfer.threads parameter 
 | 
  
   
40960 
 | 
 
| 
   
HDFS->Configs->Advanced->DataNode 
 | 
  
   
DataNode directories permission 
 | 
  
   
dfs.datanode.data.dir.perm 
 | 
  
   
750 
 | 
 
| 
   
HDFS->Configs->Advanced->General 
 | 
  
   
Access time precision 
 | 
  
   
dfs.namenode.accesstime.precision 
 | 
  
   
0 
 | 
 
| 
   
HDFS->Configs->Advanced->Advanced
  hdfs-site 
 | 
  
   | 
  
   
dfs.block.access.token.enable 
 | 
  
   
false for an unsecured HDFS cluster 
true for a secure cluster 
 | 
 
| 
   
HDFS->Configs->Advanced->Advanced
  hdfs-site 
 | 
  
   
HDFS Short-circuit read 
 | 
  
   
dfs.client.read.shortcircuit 
 | 
  
   
true 
 | 
 
| 
   
HDFS->Configs->Advanced->Advanced
  hdfs-site 
 | 
  
   
NameNode Server threads  
 | 
  
   
dfs.namenode.handler.count 
 | 
  
   
600 
 | 
 
| 
   
HDFS->Configs->Advanced->Advanced
  hdfs-site 
 | 
  
   | 
  
   
dfs.support.append 
 | 
  
   
true 
 | 
 
| 
   
HDFS->Configs->Advanced->Advanced
  hdfs-site 
 | 
  
   | 
  
   
dfs.allow.truncate 
 | 
  
   
true 
 | 
 
| 
   
HDFS->Configs->Advanced->Custom
  hdfs-site 
 | 
  
   | 
  
   
dfs.block.local-path-access.user 
 | 
  
   
gpadmin 
 | 
 
| 
   
HDFS->Configs->Advanced->Custom
  hdfs-site 
 | 
  
   | 
  
   
dfs.client.socket-timeout 
 | 
  
   
300000000 
 | 
 
| 
   
HDFS->Configs->Advanced->Custom
  hdfs-site 
 | 
  
   | 
  
   
dfs.client.use.legacy.blockreader.local 
 | 
  
   
false 
 | 
 
| 
   
HDFS->Configs->Advanced->Custom
  hdfs-site 
 | 
  
   | 
  
   
dfs.datanode.handler.count 
 | 
  
   
60 
 | 
 
| 
   
HDFS->Configs->Advanced->Custom
  hdfs-site 
 | 
  
   | 
  
   
dfs.datanode.socket.write.timeout 
 | 
  
   
7200000 
 | 
 
| 
   
HDFS->Configs->Advanced->Advanced
  core-site 
 | 
  
   | 
  
   
ipc.client.connection.maxidletime 
 | 
  
   
3600000 
 | 
 
| 
   
HDFS->Configs->Advanced->Custom
  core-site 
 | 
  
   | 
  
   
ipc.client.connect.timeout 
 | 
  
   
300000 
 | 
 
| 
   
HDFS->Configs->Advanced->Custom
  core-site 
 | 
  
   | 
  
   
ipc.server.listen.queue.size 
 | 
  
   
3300 
 | 
 
Note: HAWQ requires that you enable
dfs.allow.truncate. The HAWQ service will fail to start if dfs.allow.truncate
is not set to “true.”
                  
If Ambari indicates that
a service must be restarted, click "Restart"
and allow the service to restart before you continue
                  
Select Actions > Add Service on the home page
                  
Choose Services
·        
HAWQ
·        
PXF
                  
Assign Masters
·        
HDP
Master Compute Node
·        
HAWQ
Master
·        
HDP
Worker Compute Node 1
·        
HAWQ
Standby Master
Notes: 
·        
Service
"HAWQ Master" and "HAWQ Standby Master" must reside on
separate hosts.
·        
Service
"HAWQ Master" component must not reside on the same host that is used
for Hive Metastore if the Hive Metastore uses the new PostgreSQL database.
·   
Assign Slaves and
Clients
·        
Isilon
OneFS cluster node
·        
HDP
Master Compute node
·        
HDP
Worker Compute nodes
·        
HAWQ
Segment
·        
PXF
Notes: 
·        
PXF
must be installed on the HDFS NameNode, the Standby NameNode (if configured),
and on each HDFS DataNode. 
·        
A
HAWQ segment must be installed on each HDFS DataNode.
·   
Customized Services
·        
Assign
HAWQ System User Password
·        
Select
"YARN" as Resource Manager
·        
Customize
HAWQ configuration based on below table
| 
   
Location 
 | 
  
   
Property (Actual Name) 
 | 
  
   
Value 
 | 
 
| 
   
HAWQ->Configs->Advanced-> Advanced
  hawq-site 
 | 
  
   
hawq_rm_yarn_address  
 | 
  
   
Specify the address and port number of the
  YARN resource manager server 
It is the value of
  yarn.resourcemanager.address in yarn-site.xml 
For example: rm1.example.com:8050 
 | 
 
| 
   
HAWQ->Configs->Advanced-> Advanced
  hawq-site 
 | 
  
   
hawq_rm_yarn_queue_name  
 | 
  
   
default 
 | 
 
| 
   
HAWQ->Configs->Advanced-> Advanced
  hawq-site 
 | 
  
   
hawq_rm_yarn_scheduler_address 
 | 
  
   
Specify the address and port number of the
  YARN scheduler server 
It is the value of
  yarn.resourcemanager.scheduler.address 
  in yarn-site.xml 
For example: rm1.example.com:8030 
 | 
 
| 
   
HAWQ->Configs->Advanced-> Custom
  yarn-client 
 | 
  
   
yarn.resourcemanager.ha 
 | 
  
   
NOT REQUIRED IF NOT ENABLE YARN HA 
Comma-delimited list of the fully qualified
  hostnames of the resource managers. 
When high availability is enabled, YARN
  ignores the value in hawq_rm_yarn_address and uses this property’s value
  instead 
For example,
  rm1.example.com:8050,rm2.example.com:8050 
 | 
 
| 
   
HAWQ->Configs->Advanced-> Custom
  yarn-client 
 | 
  
   
yarn.resourcemanager.scheduler.ha 
 | 
  
   
NOT REQUIRED IF NOT ENABLE YARN HA 
Comma-delimited list of fully qualified
  hostnames of the resource manager schedulers. 
When high availability is enabled, YARN
  ignores the value in hawq_rm_yarn_scheduler_address and uses this property’s
  value instead.  
For
  example,rm1.example.com:8030,rm2.example.com:8030 
 | 
 
Notes:
For more configuration for this step, please refer
step 17 of this link: http://hdb.docs.pivotal.io/20/install/install-ambari.html
14. Review
Carefully review your configuration and then
click "Deploy"
15. Complete Installation
5.     Validate HAWQ deployment
1.     Login HAWQ Master Node using root account
2.     Run below command to connect to HAWQ instance
su - gpadmin
source
/usr/local/hawq/greenplum_path.sh
psql -d postgres
3.Run below SQL command to validate HAWQ
create database mytest;
\c mytest
create table t (i int);
insert into t select
generate_series(1,100);
\timing
select count(*) from t;
drop table t;
\c template1;
drop database mytest;
6.     Install HAWQ benchmark tool pivotalguru
1.      Upload pivotalguru.zip to HAWQ master node directory /home/gpadmin
Notes:
If you use the pivotalguru.zip from GitHub, please
change below parameters in file tpcds_variables.sh
REPO_URL="/home/gpadmin/pivotalguru/TPC-DS"
GEN_DATA_SCALE="1"
2. Add below line in file /home/gpadmin/.bashrc
/usr/local/hawq/greenplum_path.sh
export
MASTER_DATA_DIRECTORY=/data/hawq/master
3. Run below command to install pivotalguru
clear &&
source
/usr/local/hawq/greenplum_path.sh &&
export
MASTER_DATA_DIRECTORY=/data/hawq/master &&
cd /home/gpadmin &&
unzip -o pivotalguru.zip
&&
chown -R gpadmin:gpadmin
/home/gpadmin/pivotalguru.zip &&
chmod +x
/home/gpadmin/pivotalguru/*.sh &&
cd /home/gpadmin/pivotalguru
&&
/home/gpadmin/pivotalguru/tpcds.sh
&&
echo "Done"
7.     Run HAWQ benchmark tool pivotalguru
4.     References:
1.      Install Apache HAWQ using Ambari
2.      Integrating Pivotal HD and HAWQ with Isilon
3.      HDFS Isilon Integration
4.      Integrating Pivotal HD and HAWQ with Isilon
5.      How to enable pxf_isilon in HAWQ 1.2.1.1