Thursday, March 31, 2016

Big Data Open Source Stack vs Traditional Stack

Big Data Open Source Stack vs Traditional Stack
Introduction
We get this question asked by IT and Business executives all the time. No, it is not what is big data and how can it be useful for me (that also), but more, in a simple way, can you tell me how big data compares to my traditional BI architecture stack?
As you will see below, we will compare the most prominent open source initiatives in big data with the traditional stack, as well as give you a glimpse of the value proposition brought by the open source initiatives in each layer. We will also briefly explain each of the initiatives and address some of the most commonly asked questions.
In Part II, we will map each of the open source stack layers to commercial vendors and explain what they bring to the table. Our aim is that this will help you connect the dots between open source and the marketing emails you are getting flooded with!
Obviously there are further technicalities involved when you consider implementation, but that is beyond the scope as of now.
Open Source Big Data Stack vs. Traditional BI/Analytics Stack
The below diagram compares the different layers of a Analytics Stack, provides you with the Open Source Initiatives that fit into the layer, and the value that you gain at each layer with the Open Source Stack.


Ok, have you spent at least 2 minutes looking at the above picture? If not, please go back and look at it again! Let us take this from the bottom…
·         Infrastructure: When a node stops working, the system redirects work to another location of the same data and the processing continues without missing a beat.
·         Data Storage: All open source stack initiatives are file based storage. Come to think of it, even traditional RDBMS’s are also files based, but you interact with the traditional RDBMS using SQL. On the other hand, you interact with files in Hadoop. What does this mean? That means, you move files into Hadoop (also known as HDFS – Hadoop Distributed File System) and you write code to read these files and write to another set of files (speaking in a simplistic manner). Yes, there is a way to manage some of this interaction using SQL “like” language.
·         Data Ingestion and Processing: Just like traditional BI, if you want to extract data from a source, all you need to do is load it and apply logic in it, the same process applies to Hadoop eco system. The primary mechanism for doing this is MapReduce, a programming framework that allows you to code your input (ingestion), logic and storing. Now, since the data sources may exist in a different format (just like traditional BI, where you may extract data from SFDC, Oracle, etc.), there are mechanism available to simplify such tasks. Sqoop allows you to move data from an RDBMS to Hadoop, while Flume allows to read streaming data (e.g. twitter feeds) and pass it to Hadoop. Just like your ETL process may be made of multiple steps orchestrated by a native or third party workflow engine, Oozie workflow provides you the capability to group multiple MapReduce jobs into functional workflow and manage dependencies and exceptions. Another alternative to MapReduce is to use Pig Latin, which provides a framework to LOAD, TRANSFORM, and DUMP /STORE data (internally this will create a set of Map Reduce code).

Hold on! There is a new top level initiative now called Apache Spark. Spark is an in-memory data processing framework that enables logic processing to be up to 100x faster than MapReduce (as if MapReduce was not fast enough!). Spark currently has three well defined functionalities, Shark (which can process SQL), Streaming, and MlLib for Machine Learning.
·         Data Access and Data Management: Once you have the data within Hadoop, you will need a mechanism to access it, right? That is where HIVE and SPARK come in. HIVE allows people who know SQL to interact with Hadoop using very much SQL like syntax. Under the covers, HIVE queries are converted to MapReduce (no surprise there!).
·         Reporting, Dashboards and Visualization: Pretty much all vendor tools can be used to create and run reports and dashboards off Hadoop. Wait, there is a caveat here. All of these tools (except those purpose built for Hadoop) use Hive and associated drivers to manage the queries. Since Hive convers the queries into MapReduce and runs it as a job, the time it takes to return the query results may be more than running the same query on a traditional RDBMS. Yes, if you have Spark, then the queries will be much faster, for sure Additionally, Hive does not support all ANSI SQL features as of now. So, you may need to consider work abounds (e.g. preprocessing some of the logic) to make sure you can get a reasonable response time.

But here is the good news. We believe the current limitations of Hive will go away very soon (maybe even in months) as the fantastic open source community continues to work on Spark to enable richer, faster SQL interaction.
·         Advanced Analytics: This is the area where open source beats everyone to the ground. Advanced Analytics is a combination of predictive analysis, algorithms and co-relations you need to understand patterns and drive decisions based on the suggested output. Open source initiatives like Mahout (part of Apache Projects) and R can run natively on Hadoop. What does this mean? It means predictive algorithms that may use to take days to run can now be run in minutes, and re-rerun and re-run as many times as you want (because such types of analysis required multiple iterations to reach an accuracy and probability level that is acceptable).
Now we come to addressing some of the most common questions we get:
·         Is open source big data stack ready for prime time? The answer is Yes. From a technology and commercial vendor support, it is very much mature to be deployed in enterprises. This is based on our own experience of implementing the stack at multiple clients and observing the stability of platform over a period of time.
·         Is this solution applicable to me? This is a tough question to answer, but we will make it simple – if you have less than 10TB of data for analysis/BI and/or have no major value in analyzing data from social feeds like Facebook or twitter, you should wait. But, also think if your data size is less than 10TB because you cannot keep all the data business wants due of cost? Can business drive more value if historical data were made available? And if I make that amount of data available, will it be more than 10 TB?
·         Can I basically move my Data Warehouse to Hadoop? We wish we could give you a black and white answer. Unfortunately, this answer will be fifty shades of grey, so to say! First of all not yet, completely (Refer to the section on Reporting, Dashboards and Visualization above). Secondly, it depends on the pain points and future vision you are addressing. There are components of this solution that can address technology pains you may be experiencing with traditional BI environment, for sure.

However, the overall benefit of this comes when you use all layers of the stack and transform the technology and business to make use of the capabilities brought forth by this architecture.
·         Are there enough people who know this stuff?Unfortunately, no. There are many companies who talk about this, but there are only a few who have gotten their hands dirty. However, based on our experience and assessment, pool of people who know the open source stack is increasing. Additionally, the opex required to implement and manage Hadoop environment will be significantly lower when you add the capex savings realized realize with this solution.
·         My existing platform, storage and BI vendors claim they work on Hadoop. So can’t I just go with that?The grey becomes even more “greyer” here. The big data hype has made many vendors to claim the compatibility. Some claims are accurate and some are not because each vendor has defined their own definition of compatibility. One simple question you can ask is “ Does your offering run natively on Hadoop, i.e. does it make use of the distributed, parallel processing nodes enabled by Hadoop or Do you basically offer a way to read from/write to Hadoop”?


Wednesday, March 30, 2016

MongoDB Replicaset installation on VMs

MongoDB Replicaset installation on VMs

1.       Create 3 VMs and install RHEL
[root@node2 ~]# uname -a
Linux node2.rs05.mongodb 2.6.32-504.el6.x86_64 #1 SMP Tue Sep 16 01:56:35 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
[root@node2 ~]#

2.       Add static IP ( Private IP in our case) to all the 3VMs
[root@node2 ~]# more /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
TYPE=Ethernet
ONBOOT=yes
NM_CONTROLLED=yes
BOOTPROTO=static
IPADDR=192.168.0.45
NETMAST=255.255.255.0
GATEWAY=192.168.0.1
DNS1=192.168.0.1
[root@node2 ~]#

3.       Assign hostname to all the 3VMs in /etc/sysconfig/network file.
[root@node2 ~]# hostname
node2.rs05.mongodb
[root@node2 ~]#

4.       Map 3VMs static IP and hostname details in the /etc/hosts file
[root@node1 ~]# more /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.44    node1.rs05.mongodb
192.168.0.45    node2.rs05.mongodb
192.168.0.46    node3.rs05.mongodb
[root@node1 ~]#

5.       Disable the firewall (iptables)  /etc/sysconfig/iptables
[root@node1 ~]# service iptables stop
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Unloading modules:                               [  OK  ]
[root@node1 ~]#

6.       Download MongoDB tar package and extract it and rename as mongodb
[root@node1 ~]# ll
total 44
-rw-------. 1 root root  1385 Mar 13 23:40 anaconda-ks.cfg
-rw-r--r--. 1 root root 28054 Mar 13 23:40 install.log
-rw-r--r--. 1 root root  7572 Mar 13 23:40 install.log.syslog
drwxr-xr-x. 3 root root  4096 Mar 14 03:20 mongo


7.       Create new mongodb folder under /usr/local/mongodb and copy mongo contents into it.
[root@node1 local]# ll mongodb/
total 100
drwxr-xr-x. 2 root root  4096 Mar 14 03:28 bin
-rw-r--r--. 1 root root 34520 Mar 14 03:28 GNU-AGPL-3.0
-rw-r--r--. 1 root root 16726 Mar 14 03:28 MPL-2
-rw-r--r--. 1 root root  1359 Mar 14 03:28 README
-rw-r--r--. 1 root root 35910 Mar 14 03:28 THIRD-PARTY-NOTICES
[root@node1 local]#

8.       Create configuration file under /etc/mongo.conf and all db, log, replicaset and mongo.pid details.
[root@node1 local]# more /etc/mongodb.conf
logpath=/var/log/mongodb/mongod.log
logappend=true
fork=true
port=27017
dbpath=/var/lib/mongodb
pidfilepath=/var/run/mongodb/mongod.pid
replSet=RS05
oplogSize=1024
[root@node1 local]#

9.       Add mongodb binaries into PATH variable  in bash profile file
[root@node1 local]# more ~/.bash_profile
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/bin:/usr/local/mongodb/bin

export PATH
[root@node1 local]#

10.   Start the mongod instance in all the VMs (nodes)
[root@node1 local]# mongod -f /etc/mongodb.conf
about to fork child process, waiting until server is ready for connections.
forked process: 4631
child process started successfully, parent exiting
[root@node1 local]#

11.   Start the mongo in node one and add the remaining 2 nodes into the replica set.
$mongo
> config={_id:"RS05", members:[{_id:0, host:"node1.rs05.mongodb:27017", priority:3}, {_id:1, host:"node2.rs05.mongodb:27017", priority:2}, {_id:2, host:"node3.rs05.mongodb:27017", priority:1}]}
> rs.initiate(config)
>rs.status()
RS05:PRIMARY> rs.status()
{
        "set" : "RS05",
        "date" : ISODate("2016-03-15T09:32:19.938Z"),
        "myState" : 1,
        "term" : NumberLong(4),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "node1.rs05.mongodb:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 180,
                        "optime" : {
                                "ts" : Timestamp(1458034175, 2),
                                "t" : NumberLong(4)
                        },
                        "optimeDate" : ISODate("2016-03-15T09:29:35Z"),
                        "electionTime" : Timestamp(1458034175, 1),
                        "electionDate" : ISODate("2016-03-15T09:29:35Z"),
                        "configVersion" : 1,
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "node2.rs05.mongodb:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 179,
                        "optime" : {
                                "ts" : Timestamp(1458034175, 2),
                                "t" : NumberLong(4)
                        },
                        "optimeDate" : ISODate("2016-03-15T09:29:35Z"),
                        "lastHeartbeat" : ISODate("2016-03-15T09:32:19.432Z"),
                        "lastHeartbeatRecv" : ISODate("2016-03-15T09:32:19.544Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "node1.rs05.mongodb:27017",
                        "configVersion" : 1
                },
                {
                        "_id" : 2,
                        "name" : "node3.rs05.mongodb:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 179,
                        "optime" : {
                                "ts" : Timestamp(1458034175, 2),
                                "t" : NumberLong(4)
                        },
                        "optimeDate" : ISODate("2016-03-15T09:29:35Z"),
                        "lastHeartbeat" : ISODate("2016-03-15T09:32:19.432Z"),
                        "lastHeartbeatRecv" : ISODate("2016-03-15T09:32:19.545Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "node1.rs05.mongodb:27017",
                        "configVersion" : 1
                }
        ],
        "ok" : 1
}
RS05:PRIMARY>


12.   Now the replica set is ready to play.

Wednesday, March 23, 2016

MongoDB Standalone Installation

MongoDB Standalone Installation best practice and reference guide.

Install MongoDB Community Edition on Red Hat Enterprise 6.6 using .rpm package

MongoDB provides officially supports packages in their own repository.

Ø  mongodb-org             A metapackage that will automatically install the four component packages listed below.
Ø  mongodb-org-server Contains the mongod daemon and associated configuration and init scripts.
Ø  mongodb-org-mongos   Contains the mongos daemon.
Ø  mongodb-org-shell    Contains the mongo shell.
Ø  mongodb-org-tools   Contains the following MongoDB tools: mongoimport bsondump, mongodump, mongoexport, mongofiles, mongooplog, mongoperf, mongorestore, mongostat, and mongotop.


The default /etc/mongod.conf configuration file supplied by the packages has bind_ip set to 127.0.0.1 by default. Modify this setting as needed for your environment before initializing a replica set.
The mongodb-org package includes various init scripts, including the init script /etc/rc.d/init.d/mongod. You can use these scripts to stop, start, and restart daemon processes.
            Ex: sudo service mongod start/stop/restart


Installing MongoDB
1.      Configure yum (package management system), here I’m using 2mongodb-org-2.6
Create a repo file /etc/yum.repos.d/mongodb-org-2.6.repo
And below details in the repo file
[mongodb-org-3.2]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/3.2/x86_64/
gpgcheck=0
enabled=1
2.      Install MongoDB by issuing repo install cmd
Sudo yum install –y mongodb-org

*Recommended to install complete package, if you want install each package individually then specify each component individually and append the version number to the package name.
*You can configure SELinux to allow MongoDB to start on Red Hat Linux-based systems, by updated SELINUX variable in /etc/selinux/config but not recommended.


Default Data Directories in MongoDB
Ø  Data Files are stored in /var/lib/mongo
Ø  Log files are stored in /var/log/mongodb
If you change the user that runs the MongoDB process, you must modify the access control rights to the /var/lib/mongo and /var/log/mongodb directories to give this user access to these directories.


Simple Start Stop and Restart CMDs
$service mongod start                    Check log to see service started and default port is in listening mode
$sudo service mongod stop                Check log to see if service is shut down and port released.
You can follow the state of the process for errors or important messages by watching the output in the /var/log/mongodb/mongod.log file







Thursday, March 17, 2016

Coudera CDH installation using remote repo

Hadoop.CDH.Installation.RemoteRepository
1.     Overview
This topic describes how to create a remote RPM packages/parcels repository and direct hosts in your Cloudera Manager deployment to use that repository.

Once you have created a parcels repository, go to Configuring the Cloudera Manager Server to Use the Parcel URL. After completing these steps, you have established the environment required to install a previous version of Cloudera Manager or install Cloudera Manager to hosts that are not connected to the Internet. Proceed with the installation process, being sure to target the newly created repository.

2.     Creating a Permanent Remote Repository
The repository is typically hosted using HTTP on a host inside your network. If you already have a web server in your organization, you can reuse it and put the parcel files into it.

Below are the detailed steps to setup a permanent remote repository:

1.      Logon the server you want to setup the web server and run below commands to install Apache httpd web server
yum install httpd
systemctl start httpd
systemctl enable httpd
·    RPM Packages
 Download the RPM packages for your OS distribution from:
 Move the RPM packages files to the web server directory, and modify file permissions
mkdir -p /var/www/html/cdh5/packages
tar -xvf cm5.*-centos7.tar.gz -C /var/www/html/cdh5/parcels
chmod -R ugo+rX /var/www/html/cdh5  (might not require)
3.      After moving the files and changing permissions, visit http://hostname:80/cdh5/parcels to verify that you can access the RPM packages. Apache may have been configured to not show indexes, which is also acceptable.

·    Parcels
a.      Download the parcel and manifest.json files for your OS distribution from:
·         CDH 5 - Impala, Spark, and Search are included in the CDH parcel
·         Accumulo - - https://archive.cloudera.com/accumulo-c5/parcels/
·         GPL Extras - https://archive.cloudera.com/gplextras5/parcels/
b.     Move the .parcel and manifest.json files to the web server directory, and modify file permissions
mkdir -p /var/www/html/cdh5/parcels
mv CDH-5.*-el7.parcel /var/www/html/cdh5/parcels
mv manifest.json /var/www/html/cdh5/parcels
chmod -R ugo+rX /var/www/html/cdh5  (might not require)
·         After moving the files and changing permissions, visit http://hostname:80/cdh5/parcels to verify that you can access the parcel. Apache may have been configured to not show indexes, which is also acceptable.

3.     Configuring the Cloudera Manager Server to Use the Parcel URL
1.      Use one of the following methods to open the parcel settings page:
·         Navigation bar
·   
     
·         Click the Configuration button.
·         Menu
·         Select Administration > Settings
·         Select Category > Parcels
2.      In the Remote Parcel Repository URLs list, click  to open an additional row.
3.      Enter the path to the parcel. For example, http://hostname:port/cdh5/parcels/.
4.      Click Save Changes to commit the changes.



4.     Reference
Creating and Using a Remote Parcel Repository for Cloudera Manager