Hadoop HAWQ Installation using Ambari
1. Introduction
This document describes how to install HAWQ
(Hadoop with Query) using Apache Ambari.
2. Environment
This installation guide is appropriate for the
below environment.
o Apache Ambari 2.X
o Hortonworks HDP 2.X
o VMware vSphere 5.5 or later
o RHEL/CentOS 6.5 or later
o Internet Explorer 10 or later
o Pivotal HDB (HAWQ) 2.0.0.0
o
Installation
1. Overview
Below is the overview of the installation
process that this document will describe.
·
Confirm prerequisites
·
Configure Isilon OneFS (Required if HDFS is deployed on Isilon)
·
Install Apache HAWQ
using Ambari
·
Validate HAWQ deployment
·
Install HAWQ benchmark
tool pivotalguru
·
Run HAWQ benchmark tool
pivotalguru
o
Confirm
Prerequisites
1. Prepare VMware virtualized environment
Before you start the installation process, a
VMware virtualized environment must be ready to provision the virtual machines
required for this deployment. ESXi 5.5 and later revision is recommend for the
virtualized environment.
o Prepare host machines that will host HAWQ
segment
·
Each
host must meet the system requirements for the version of HAWQ you are
installing
For the detail, please refer HAWQ release notes:
http://hdb.docs.pivotal.io/hdb20/releasenotes/HAWQ20ReleaseNotes.html
·
Hosts that run other
PostgreSQL instances cannot be used to run a default HAWQ master or standby
service configuration. The default PostgreSQL port (5432) conflicts with the
default port assignment of those services. You must either change the default
port configuration of the running PostgreSQL instance or change the HAWQ master
port setting during the HAWQ service installation to avoid port conflicts.
Note:
The Ambari server node uses PostgreSQL as the
default metadata database.
The Hive Metastore uses MySQL as the default
metadata database.
·
Each HAWQ segment must
be co-located on a host that runs an HDFS data node.
·
The HAWQ master segment
and standby master segment must be hosted on separate machines.
Prepare
the host machines that will run PXF (Pivotal Extension Framework)
·
PXF
must be installed on the HDFS NameNode and on all HDFS data nodes.
·
If
you have configured Hadoop with high availability, PXF must also be installed
on all HDFS nodes including all NameNode services.
·
If
you want to use PXF with HBase or Hive, you must first install the HBase client
(hbase-client) and/or Hive client (hive-client) on each machine where you
intend to install PXF. See the HDP installation documentation for more
information.
3. Configure Isilon OneFS (Required if HDFS is deployed on Isilon)
1. Login to the Isilon OneFS cluster node using root account
2. Change the Isilon OneFS "Access Pattern" from "Concurrency" to "Streaming"
isi get /ifs
isi set -R -a streaming -l
streaming /ifs
isi get /ifs
3. Change Isilon HDFS Global Settings
isi hdfs settings view
isi hdfs settings modify
--server-threads=auto #not required for
OneFS 8.0
isi hdfs settings modify
--default-block-size=128M
isi hdfs settings view
4. Create HAWQ user and group
isi auth groups create
gpadmin --gid 503
isi auth users create gpadmin
--enabled true --password --uid 503
--primary-group gpadmin
4. Install Apache HAWQ using Ambari (New
Installation)
1. Login to the Ambari server host machine as the root user
2. Install HAWQ Ambari plug-in
yum install hdb-ambari-plugin
3. Restart Ambari Server
service ambari-server restart
4. Add HDB repository to the Ambari server
·
This step is required if
you have already installed a HDP cluster and are adding HDB to the existing
cluster
·
This step is not required
if you are installing a new HDP cluster and HDB together at the same time
python
/var/lib/hawq/add_hdb_repo.py -u admin -p admin --repourl http://yumrepo.bigdata.emc.local/hdp/HDB/hdb-2.0.0.0
Access the Ambari web
console at http://hdp-ambari.bigdata.emc.local:8080 and login with account admin/admin
Verify HDB component is
now available
Customize HDFS
configuration based on below table
Location
|
Property (Display Name)
|
Property (Actual Name)
|
Value
|
HDFS->Configs->Settings->DataNode
|
DataNode max data transfer threads
|
dfs.datanode.max.transfer.threads parameter
|
40960
|
HDFS->Configs->Advanced->DataNode
|
DataNode directories permission
|
dfs.datanode.data.dir.perm
|
750
|
HDFS->Configs->Advanced->General
|
Access time precision
|
dfs.namenode.accesstime.precision
|
0
|
HDFS->Configs->Advanced->Advanced
hdfs-site
|
|
dfs.block.access.token.enable
|
false for an unsecured HDFS cluster
true for a secure cluster
|
HDFS->Configs->Advanced->Advanced
hdfs-site
|
HDFS Short-circuit read
|
dfs.client.read.shortcircuit
|
true
|
HDFS->Configs->Advanced->Advanced
hdfs-site
|
NameNode Server threads
|
dfs.namenode.handler.count
|
600
|
HDFS->Configs->Advanced->Advanced
hdfs-site
|
|
dfs.support.append
|
true
|
HDFS->Configs->Advanced->Advanced
hdfs-site
|
|
dfs.allow.truncate
|
true
|
HDFS->Configs->Advanced->Custom
hdfs-site
|
|
dfs.block.local-path-access.user
|
gpadmin
|
HDFS->Configs->Advanced->Custom
hdfs-site
|
|
dfs.client.socket-timeout
|
300000000
|
HDFS->Configs->Advanced->Custom
hdfs-site
|
|
dfs.client.use.legacy.blockreader.local
|
false
|
HDFS->Configs->Advanced->Custom
hdfs-site
|
|
dfs.datanode.handler.count
|
60
|
HDFS->Configs->Advanced->Custom
hdfs-site
|
|
dfs.datanode.socket.write.timeout
|
7200000
|
HDFS->Configs->Advanced->Advanced
core-site
|
|
ipc.client.connection.maxidletime
|
3600000
|
HDFS->Configs->Advanced->Custom
core-site
|
|
ipc.client.connect.timeout
|
300000
|
HDFS->Configs->Advanced->Custom
core-site
|
|
ipc.server.listen.queue.size
|
3300
|
Note: HAWQ requires that you enable
dfs.allow.truncate. The HAWQ service will fail to start if dfs.allow.truncate
is not set to “true.”
If Ambari indicates that
a service must be restarted, click "Restart"
and allow the service to restart before you continue
Select Actions > Add Service on the home page
Choose Services
·
HAWQ
·
PXF
Assign Masters
·
HDP
Master Compute Node
·
HAWQ
Master
·
HDP
Worker Compute Node 1
·
HAWQ
Standby Master
Notes:
·
Service
"HAWQ Master" and "HAWQ Standby Master" must reside on
separate hosts.
·
Service
"HAWQ Master" component must not reside on the same host that is used
for Hive Metastore if the Hive Metastore uses the new PostgreSQL database.
·
Assign Slaves and
Clients
·
Isilon
OneFS cluster node
·
HDP
Master Compute node
·
HDP
Worker Compute nodes
·
HAWQ
Segment
·
PXF
Notes:
·
PXF
must be installed on the HDFS NameNode, the Standby NameNode (if configured),
and on each HDFS DataNode.
·
A
HAWQ segment must be installed on each HDFS DataNode.
·
Customized Services
·
Assign
HAWQ System User Password
·
Select
"YARN" as Resource Manager
·
Customize
HAWQ configuration based on below table
Location
|
Property (Actual Name)
|
Value
|
HAWQ->Configs->Advanced-> Advanced
hawq-site
|
hawq_rm_yarn_address
|
Specify the address and port number of the
YARN resource manager server
It is the value of
yarn.resourcemanager.address in yarn-site.xml
For example: rm1.example.com:8050
|
HAWQ->Configs->Advanced-> Advanced
hawq-site
|
hawq_rm_yarn_queue_name
|
default
|
HAWQ->Configs->Advanced-> Advanced
hawq-site
|
hawq_rm_yarn_scheduler_address
|
Specify the address and port number of the
YARN scheduler server
It is the value of
yarn.resourcemanager.scheduler.address
in yarn-site.xml
For example: rm1.example.com:8030
|
HAWQ->Configs->Advanced-> Custom
yarn-client
|
yarn.resourcemanager.ha
|
NOT REQUIRED IF NOT ENABLE YARN HA
Comma-delimited list of the fully qualified
hostnames of the resource managers.
When high availability is enabled, YARN
ignores the value in hawq_rm_yarn_address and uses this property’s value
instead
For example,
rm1.example.com:8050,rm2.example.com:8050
|
HAWQ->Configs->Advanced-> Custom
yarn-client
|
yarn.resourcemanager.scheduler.ha
|
NOT REQUIRED IF NOT ENABLE YARN HA
Comma-delimited list of fully qualified
hostnames of the resource manager schedulers.
When high availability is enabled, YARN
ignores the value in hawq_rm_yarn_scheduler_address and uses this property’s
value instead.
For
example,rm1.example.com:8030,rm2.example.com:8030
|
Notes:
For more configuration for this step, please refer
step 17 of this link: http://hdb.docs.pivotal.io/20/install/install-ambari.html
14. Review
Carefully review your configuration and then
click "Deploy"
15. Complete Installation
5. Validate HAWQ deployment
1. Login HAWQ Master Node using root account
2. Run below command to connect to HAWQ instance
su - gpadmin
source
/usr/local/hawq/greenplum_path.sh
psql -d postgres
3.Run below SQL command to validate HAWQ
create database mytest;
\c mytest
create table t (i int);
insert into t select
generate_series(1,100);
\timing
select count(*) from t;
drop table t;
\c template1;
drop database mytest;
6. Install HAWQ benchmark tool pivotalguru
1. Upload pivotalguru.zip to HAWQ master node directory /home/gpadmin
Notes:
If you use the pivotalguru.zip from GitHub, please
change below parameters in file tpcds_variables.sh
REPO_URL="/home/gpadmin/pivotalguru/TPC-DS"
GEN_DATA_SCALE="1"
2. Add below line in file /home/gpadmin/.bashrc
/usr/local/hawq/greenplum_path.sh
export
MASTER_DATA_DIRECTORY=/data/hawq/master
3. Run below command to install pivotalguru
clear &&
source
/usr/local/hawq/greenplum_path.sh &&
export
MASTER_DATA_DIRECTORY=/data/hawq/master &&
cd /home/gpadmin &&
unzip -o pivotalguru.zip
&&
chown -R gpadmin:gpadmin
/home/gpadmin/pivotalguru.zip &&
chmod +x
/home/gpadmin/pivotalguru/*.sh &&
cd /home/gpadmin/pivotalguru
&&
/home/gpadmin/pivotalguru/tpcds.sh
&&
echo "Done"
7. Run HAWQ benchmark tool pivotalguru
4. References:
1. Install Apache HAWQ using Ambari
2. Integrating Pivotal HD and HAWQ with Isilon
3. HDFS Isilon Integration
4. Integrating Pivotal HD and HAWQ with Isilon
5. How to enable pxf_isilon in HAWQ 1.2.1.1