BigData: 03/02/15

Hadoop HAWQ Installation using Ambari

1. Introduction

This document describes how to install HAWQ (Hadoop with Query) using Apache Ambari.

2. Environment

This installation guide is appropriate for the below environment.

o Apache Ambari 2.X

o Hortonworks HDP 2.X

o VMware vSphere 5.5 or later

o RHEL/CentOS 6.5 or later

o Internet Explorer 10 or later

o Pivotal HDB (HAWQ) 2.0.0.0

o Installation

1. Overview

Below is the overview of the installation process that this document will describe.

· Confirm prerequisites

· Configure Isilon OneFS (Required if HDFS is deployed on Isilon)

· Install Apache HAWQ using Ambari

· Validate HAWQ deployment

· Install HAWQ benchmark tool pivotalguru

· Run HAWQ benchmark tool pivotalguru

o Confirm Prerequisites

1. Prepare VMware virtualized environment

Before you start the installation process, a VMware virtualized environment must be ready to provision the virtual machines required for this deployment. ESXi 5.5 and later revision is recommend for the virtualized environment.

o Prepare host machines that will host HAWQ segment

· Each host must meet the system requirements for the version of HAWQ you are installing

For the detail, please refer HAWQ release notes: http://hdb.docs.pivotal.io/hdb20/releasenotes/HAWQ20ReleaseNotes.html

· Hosts that run other PostgreSQL instances cannot be used to run a default HAWQ master or standby service configuration. The default PostgreSQL port (5432) conflicts with the default port assignment of those services. You must either change the default port configuration of the running PostgreSQL instance or change the HAWQ master port setting during the HAWQ service installation to avoid port conflicts.

Note:

The Ambari server node uses PostgreSQL as the default metadata database.

The Hive Metastore uses MySQL as the default metadata database.

· Each HAWQ segment must be co-located on a host that runs an HDFS data node.

· The HAWQ master segment and standby master segment must be hosted on separate machines.

Prepare the host machines that will run PXF (Pivotal Extension Framework)

· PXF must be installed on the HDFS NameNode and on all HDFS data nodes.

· If you have configured Hadoop with high availability, PXF must also be installed on all HDFS nodes including all NameNode services.

· If you want to use PXF with HBase or Hive, you must first install the HBase client (hbase-client) and/or Hive client (hive-client) on each machine where you intend to install PXF. See the HDP installation documentation for more information.

3. Configure Isilon OneFS (Required if HDFS is deployed on Isilon)

1. Login to the Isilon OneFS cluster node using root account

2. Change the Isilon OneFS "Access Pattern" from "Concurrency" to "Streaming"

isi get /ifs

isi set -R -a streaming -l streaming /ifs

isi get /ifs

3. Change Isilon HDFS Global Settings

isi hdfs settings view

isi hdfs settings modify --server-threads=auto #not required for OneFS 8.0

isi hdfs settings modify --default-block-size=128M

isi hdfs settings view

4. Create HAWQ user and group

isi auth groups create gpadmin --gid 503

isi auth users create gpadmin --enabled true --password --uid 503 --primary-group gpadmin

4. Install Apache HAWQ using Ambari (New Installation)

1. Login to the Ambari server host machine as the root user

2. Install HAWQ Ambari plug-in

yum install hdb-ambari-plugin

3. Restart Ambari Server

service ambari-server restart

4. Add HDB repository to the Ambari server

· This step is required if you have already installed a HDP cluster and are adding HDB to the existing cluster

· This step is not required if you are installing a new HDP cluster and HDB together at the same time

python /var/lib/hawq/add_hdb_repo.py -u admin -p admin --repourl http://yumrepo.bigdata.emc.local/hdp/HDB/hdb-2.0.0.0

Access the Ambari web console at http://hdp-ambari.bigdata.emc.local:8080 and login with account admin/admin

Verify HDB component is now available

Customize HDFS configuration based on below table

Location	Property (Display Name)	Property (Actual Name)	Value
HDFS->Configs->Settings->DataNode	DataNode max data transfer threads	dfs.datanode.max.transfer.threads parameter	40960
HDFS->Configs->Advanced->DataNode	DataNode directories permission	dfs.datanode.data.dir.perm	750
HDFS->Configs->Advanced->General	Access time precision	dfs.namenode.accesstime.precision	0
HDFS->Configs->Advanced->Advanced hdfs-site		dfs.block.access.token.enable	false for an unsecured HDFS cluster true for a secure cluster
HDFS->Configs->Advanced->Advanced hdfs-site	HDFS Short-circuit read	dfs.client.read.shortcircuit	true
HDFS->Configs->Advanced->Advanced hdfs-site	NameNode Server threads	dfs.namenode.handler.count	600
HDFS->Configs->Advanced->Advanced hdfs-site		dfs.support.append	true
HDFS->Configs->Advanced->Advanced hdfs-site		dfs.allow.truncate	true
HDFS->Configs->Advanced->Custom hdfs-site		dfs.block.local-path-access.user	gpadmin
HDFS->Configs->Advanced->Custom hdfs-site		dfs.client.socket-timeout	300000000
HDFS->Configs->Advanced->Custom hdfs-site		dfs.client.use.legacy.blockreader.local	false
HDFS->Configs->Advanced->Custom hdfs-site		dfs.datanode.handler.count	60
HDFS->Configs->Advanced->Custom hdfs-site		dfs.datanode.socket.write.timeout	7200000
HDFS->Configs->Advanced->Advanced core-site		ipc.client.connection.maxidletime	3600000
HDFS->Configs->Advanced->Custom core-site		ipc.client.connect.timeout	300000
HDFS->Configs->Advanced->Custom core-site		ipc.server.listen.queue.size	3300

Note: HAWQ requires that you enable dfs.allow.truncate. The HAWQ service will fail to start if dfs.allow.truncate is not set to “true.”

If Ambari indicates that a service must be restarted, click "Restart" and allow the service to restart before you continue

Select Actions > Add Service on the home page

Choose Services

· HAWQ

· PXF

Assign Masters

· HDP Master Compute Node

· HAWQ Master

· HDP Worker Compute Node 1

· HAWQ Standby Master

Notes:

· Service "HAWQ Master" and "HAWQ Standby Master" must reside on separate hosts.

· Service "HAWQ Master" component must not reside on the same host that is used for Hive Metastore if the Hive Metastore uses the new PostgreSQL database.

· Assign Slaves and Clients

· Isilon OneFS cluster node

· HDP Master Compute node

· HDP Worker Compute nodes

· HAWQ Segment

· PXF

Notes:

· PXF must be installed on the HDFS NameNode, the Standby NameNode (if configured), and on each HDFS DataNode.

· A HAWQ segment must be installed on each HDFS DataNode.

· Customized Services

· Assign HAWQ System User Password

· Select "YARN" as Resource Manager

· Customize HAWQ configuration based on below table

Location	Property (Actual Name)	Value
HAWQ->Configs->Advanced-> Advanced hawq-site	hawq_rm_yarn_address	Specify the address and port number of the YARN resource manager server It is the value of yarn.resourcemanager.address in yarn-site.xml For example: rm1.example.com:8050
HAWQ->Configs->Advanced-> Advanced hawq-site	hawq_rm_yarn_queue_name	default
HAWQ->Configs->Advanced-> Advanced hawq-site	hawq_rm_yarn_scheduler_address	Specify the address and port number of the YARN scheduler server It is the value of yarn.resourcemanager.scheduler.address in yarn-site.xml For example: rm1.example.com:8030
HAWQ->Configs->Advanced-> Custom yarn-client	yarn.resourcemanager.ha	NOT REQUIRED IF NOT ENABLE YARN HA Comma-delimited list of the fully qualified hostnames of the resource managers. When high availability is enabled, YARN ignores the value in hawq_rm_yarn_address and uses this property’s value instead For example, rm1.example.com:8050,rm2.example.com:8050
HAWQ->Configs->Advanced-> Custom yarn-client	yarn.resourcemanager.scheduler.ha	NOT REQUIRED IF NOT ENABLE YARN HA Comma-delimited list of fully qualified hostnames of the resource manager schedulers. When high availability is enabled, YARN ignores the value in hawq_rm_yarn_scheduler_address and uses this property’s value instead. For example,rm1.example.com:8030,rm2.example.com:8030

Notes:

For more configuration for this step, please refer step 17 of this link: http://hdb.docs.pivotal.io/20/install/install-ambari.html

14. Review

Carefully review your configuration and then click "Deploy"

15. Complete Installation

5. Validate HAWQ deployment

1. Login HAWQ Master Node using root account

2. Run below command to connect to HAWQ instance

su - gpadmin

source /usr/local/hawq/greenplum_path.sh

psql -d postgres

3.Run below SQL command to validate HAWQ

create database mytest;

\c mytest

create table t (i int);

insert into t select generate_series(1,100);

\timing

select count(*) from t;

drop table t;

\c template1;

drop database mytest;

6. Install HAWQ benchmark tool pivotalguru

1. Upload pivotalguru.zip to HAWQ master node directory /home/gpadmin

Notes:

If you use the pivotalguru.zip from GitHub, please change below parameters in file tpcds_variables.sh

REPO_URL="/home/gpadmin/pivotalguru/TPC-DS"

GEN_DATA_SCALE="1"

2. Add below line in file /home/gpadmin/.bashrc

/usr/local/hawq/greenplum_path.sh

export MASTER_DATA_DIRECTORY=/data/hawq/master

3. Run below command to install pivotalguru

clear &&

source /usr/local/hawq/greenplum_path.sh &&

export MASTER_DATA_DIRECTORY=/data/hawq/master &&

cd /home/gpadmin &&

unzip -o pivotalguru.zip &&

chown -R gpadmin:gpadmin /home/gpadmin/pivotalguru.zip &&

chmod +x /home/gpadmin/pivotalguru/*.sh &&

cd /home/gpadmin/pivotalguru &&

/home/gpadmin/pivotalguru/tpcds.sh &&

echo "Done"

7. Run HAWQ benchmark tool pivotalguru

4. References:

1. Install Apache HAWQ using Ambari

http://hdb.docs.pivotal.io/20/install/install-ambari.html

2. Integrating Pivotal HD and HAWQ with Isilon

http://hawq.docs.pivotal.io/isilon/isilon-integration.html

3. HDFS Isilon Integration

https://docs.pivotal.io/pivotalhd-ds/isilon.html

4. Integrating Pivotal HD and HAWQ with Isilon

http://hawq.docs.pivotal.io/isilon/isilon-integration.html

5. How to enable pxf_isilon in HAWQ 1.2.1.1

http://pivotalhd-210.docs.pivotal.io/doc/2100/pdf/PHAWQ-1211-release-notes.pdf

Monday, March 2, 2015

Hadoop HAWQ Installation using Ambari