BigData: 2016

Saturday, June 25, 2016

Splunk Indexer Cluster Deployment

Splunk Indexer Cluster Deployment

Environment Setup

· Download Ubuntu Linux 12.04.3 LTS or CentOS 6.7 (Any linux flavor)

· Download Latest Splunk Enterprise tar ball ( For simplicity)

· Download latest Universal Splunk Forwarder tar ball.

Pre-requisites

· Do the necessary network settings and assign static IP if preferred.

· Assign the hostname in network and hosts files.

Overview

1. Identify requirement

Index cluster, with RF 2 and SF 2

For this number of nodes required will be RF +2

2indexer, 1 master node and 1 search head.

2. Install the Splunk Enterprise cluster instances on your network

Install Splunk Enterprise on 4 nodes that is 2RF so two search peer nodes(indexers) and 1 master and 1search head.

You need at least the replication factor number of peer nodes, but you might want to add more peers to boost indexing capacity, as mentioned in step 1d.

You also need two more instances, one for the master node and the other for the search head.

3. Enable clustering on the instances:

a. Enable the master node. See "Enable the master node".

Important: When the master starts up for the first time, it will block indexing on the peers until you have enabled and restarted the replication factor number of peers.

b. Enable the peer nodes. See "Enable the peer nodes".

c. Enable the cluster search head. It's easier to set up a search head for a cluster than for a non-clustered group of indexers. See "Enable the search head".

Step 1> Identify the nodes with their roles and assign hostname and name for simplicity and easy management. In the below scenario I’m setting up 1 Search Head, 2 Indexers, 1 Master Node and 1 forwarder node.

Step 2> Install Splunk Enterprise binaries on the Search Head, Indexer_01 and Indexer_02 assigned machines. (Extract the Splunk tar ball to the /opt folder as explained in the single node installation)

[root@splunk_standalone sbk_files]# tar -zxvf splunk-6.3.1-f3e41e4b37b2-Linux-x86_64.tgz -C /opt

Search Head Machine

[root@search_head splunk]# pwd

/opt/splunk

[root@search_head splunk]# ll

total 1800

drwxr-xr-x. 4 506 506 4096 Oct 30 2015 bin

-r--r--r--. 1 506 506 57 Oct 30 2015 copyright.txt

drwxr-xr-x. 16 506 506 4096 May 6 04:36 etc

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 include

drwxr-xr-x. 6 506 506 4096 Oct 30 2015 lib

-r--r--r--. 1 506 506 62027 Oct 30 2015 license-eula.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 openssl

-r--r--r--. 1 506 506 509 Oct 30 2015 README-splunk.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 share

-r--r--r--. 1 506 506 1737206 Oct 30 2015 splunk-6.3.1-f3e41e4b37b2-linux-2.6-x86_64-manifest

drwx--x--x. 6 root root 4096 May 6 04:06 var

[root@search_head splunk]#

Indexer_01 Machine

[root@indexer_01 splunk]# pwd

/opt/splunk

[root@indexer_01 splunk]# ll

total 1800

drwxr-xr-x. 4 506 506 4096 Oct 30 2015 bin

-r--r--r--. 1 506 506 57 Oct 30 2015 copyright.txt

drwxr-xr-x. 16 506 506 4096 May 6 04:35 etc

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 include

drwxr-xr-x. 6 506 506 4096 Oct 30 2015 lib

-r--r--r--. 1 506 506 62027 Oct 30 2015 license-eula.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 openssl

-r--r--r--. 1 506 506 509 Oct 30 2015 README-splunk.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 share

-r--r--r--. 1 506 506 1737206 Oct 30 2015 splunk-6.3.1-f3e41e4b37b2-linux-2.6-x86_64-manifest

drwx--x--x. 6 root root 4096 May 6 04:07 var

[root@indexer_01 splunk]#

Indexer_02 Machine

[root@indexer_02 splunk]# pwd

/opt/splunk

[root@indexer_02 splunk]# ll

total 1800

drwxr-xr-x. 4 506 506 4096 Oct 30 2015 bin

-r--r--r--. 1 506 506 57 Oct 30 2015 copyright.txt

drwxr-xr-x. 16 506 506 4096 May 8 22:15 etc

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 include

drwxr-xr-x. 6 506 506 4096 Oct 30 2015 lib

-r--r--r--. 1 506 506 62027 Oct 30 2015 license-eula.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 openssl

-r--r--r--. 1 506 506 509 Oct 30 2015 README-splunk.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 share

-r--r--r--. 1 506 506 1737206 Oct 30 2015 splunk-6.3.1-f3e41e4b37b2-linux-2.6-x86_64-manifest

drwx--x--x. 6 root root 4096 May 8 22:08 var

[root@indexer_02 splunk]#

Master Node Machine

[root@c_master_node splunk]# pwd

/opt/splunk

[root@c_master_node splunk]# ll

total 1800

drwxr-xr-x. 4 506 506 4096 Oct 30 2015 bin

-r--r--r--. 1 506 506 57 Oct 30 2015 copyright.txt

drwxr-xr-x. 16 506 506 4096 May 9 01:53 etc

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 include

drwxr-xr-x. 6 506 506 4096 Oct 30 2015 lib

-r--r--r--. 1 506 506 62027 Oct 30 2015 license-eula.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 openssl

-r--r--r--. 1 506 506 509 Oct 30 2015 README-splunk.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 share

-r--r--r--. 1 506 506 1737206 Oct 30 2015 splunk-6.3.1-f3e41e4b37b2-linux-2.6-x86_64-manifest

drwx--x--x. 6 root root 4096 May 9 01:49 var

[root@c_master_node splunk]#

Step 3> Install Universal Forwarder binaries into c_u_fwd_01 name assigned machine. (Extract the Splunk tar ball to the /opt folder as explained in the single node installation)

[root@splunk_standalone sbk_files]# tar -zxvf splunk-6.3.1-f3e41e4b37b2-Linux-x86_64.tgz -C /opt

C_U_Fwd_01 Machine

[root@u_fwd_01 splunkforwarder]# pwd

/opt/splunkforwarder

[root@u_fwd_01 splunkforwarder]# ll

total 132

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 bin

-r--r--r--. 1 506 506 57 Oct 30 2015 copyright.txt

drwxr-xr-x. 13 506 506 4096 May 6 04:08 etc

drwxr-xr-x. 2 506 506 4096 Oct 30 2015 include

drwxr-xr-x. 4 506 506 4096 Oct 30 2015 lib

-r--r--r--. 1 506 506 62027 Oct 30 2015 license-eula.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 openssl

-r--r--r--. 1 506 506 509 Oct 30 2015 README-splunk.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 share

-r--r--r--. 1 506 506 31876 Oct 30 2015 splunkforwarder-6.3.1-f3e41e4b37b2-linux-2.6-x86_64-manifest

drwx--x--x. 6 root root 4096 May 6 04:08 var

[root@u_fwd_01 splunkforwarder]#

Configuring the instances into distributed deployment

Enable the master

To enable an indexer as the master node:

1. Click Settings in the upper right corner of Splunk Web.

2. In the Distributed environment group, click Clustering.

3. Select Enable clustering.

4. Select Master node and click Next.

5. There are a few fields to fill out:

Replication Factor.The replication factor determines how many copies of data the cluster maintains. The default is 3. For more information on the replication factor, see "Replication factor". Choose the right replication factor now. It is inadvisable to increase the replication factor later, once the cluster has significant amounts of data.
Search Factor. The search factor determines how many immediately searchable copies of data the cluster maintains. The default is 2. For more information on the search factor, see "Search factor". Choose the right search factor now. It is highly inadvisable to increase the search factor later, once the cluster has significant amounts of data.
Security Key. This is the key that authenticates communication between the master and the peers and search heads. The key must be the same across all cluster instances. If you leave the field empty here, leave it empty on the peers and search heads as well.

6. Click Enable master node.

7. The message appears, "You must restart Splunk for the master node to become active. You can restart Splunk from Server Controls." Click Go to Server Controls to go to the Settings page where you can initiate the restart.

Enable the peer

To enable an indexer as a peer node:

1. Click Settings in the upper right corner of Splunk Web.

2. In the Distributed environment group, click Clustering.

3. Select Enable clustering.

4. Select Peer node and click Next.

5. There are a few fields to fill out:

Master IP address or Hostname. Enter the master's IP address or hostname. For example: https://10.152.31.202.
Master port. Enter the master's management port. For example: 8089.
Peer replication port. This is the port on which the peer receives replicated data streamed from the other peers. You can specify any available, unused port for this purpose. This port must be different from the management port and receiving port.
Security key. This is the key that authenticates communication between the master and the peers and search heads. The key must be the same across all cluster instances. If the master has a security key, you must enter it here.

6. Click Enable peer node.

7. The message appears, "You must restart Splunk for the peer node to become active." Click Go to Server Controls to go to the Settings page where you can initiate the restart.

8. Repeat this process for all the cluster's peer nodes.

Enable the search head

To enable a Splunk instance as a cluster search head:

1. Click Settings in the upper right corner of Splunk Web.

2. In the Distributed environment group, click Clustering.

3. Select Enable clustering.

4. Select Search head node and click Next.

5. There are a few fields to fill out:

Master IP address or Hostname. Enter the master's IP address or hostname. For example: https://10.152.31.202.
Master port. Enter the master's management port. For example: 8089.
Security key. This is the key that authenticates communication between the master and the peers and search heads. The key must be the same across all cluster instances. If the master has a security key, you must enter it here.

6. Click Enable search head node.

7. The message appears, "You must restart Splunk for the search node to become active. You can restart Splunk from Server Controls" ClickGo to Server Controls to go to the Settings page where you can initiate the restart.

Use forwarders to get your data into the indexer cluster

Configure the connection from forwarder to peer node

There are three steps to setting up connections between forwarders and peer nodes:

1. Configure the peer nodes to receive data from forwarders.

2. Configure the forwarders to send data to the peer nodes.

3. Enable indexer acknowledgment for each forwarder. This step is required to ensure end-to-end data fidelity. If that is not a requirement for your deployment, you can skip this step

Example: A load-balancing forwarder with indexer acknowledgment

Here's a sample outputs.conf configuration for a forwarder that's using load balancing to send data in sequence to three peers in a cluster. It assumes that each of the peers has previously been configured to use 9997 for its receiving port:

[tcpout]
defaultGroup=my_LB_peers

[tcpout:my_LB_peers]
autoLBFrequency=40
server=10.10.10.1:9997,10.10.10.2:9997,10.10.10.3:9997
useACK=true

Tuesday, May 31, 2016

Effective Agile Test Framework for Big data

Abstract –

As saying goes “Simplicity is the ultimate sophistications” the Testing frame work should be simple, agile, collaborative and scalable, which we will design and develop here to address these 3 major challenges , faced while testing any Bigdata applications especially Hadoop based techniques, First Processing Time to crunch huge data sets? Second is validation required for all the phases of data transformation and transitions? Third how should we validating transformed/ processed data?

The framework developed here will address the above challenges; framework generates a small representative data set from any original large data set using space partition testing. Using this data set for development and testing would not hinder the continuous integration and delivery when using agile processes. The test framework also accesses and validates data at various transition points when data is transferred and transformed.

Introduction—

Big data is a big topic these days, one that has made its way up to the executive level. Most organizations may not yet fully understand what big data is, exactly, but they know he or she needs a plan for managing it. The big data construct theory described by Gartner’s has 3Vs – data Volume, Velocity, and Variety), are all growing dramatically and we’re being asked to process data ever more quickly so we can respond to events as they happen, and that data is coming from an ever wider array of channels, sensors and formats.

The velocity of data generation is high and these data have to be analyzed in a timely manner. People may derive valuable information by mining high volumes of datasets in various formats. Many programs that use big data techniques (e.g., Hadoop1/2) and processes big data are currently being developed. How to test such software effectively and efficiently is challenging. Currently presented issues in big data testing and proposed solution that generates huge and realistic data for databases likeYCSB Benchmarking for NoSQL DBs, SBK conductor tool Kit for Splunk etc. This paper focuses on generating small and representative datasets from very large sets of data. This can save the cost of processing large amounts of data, which hinders continuous integration and delivery during agile development. This paper introduces an Effective Agile scalable big data test framework to test Extract, Transform, and Load (ETL/ELT) applications that use big data techniques.

ETL (Extract, Transform and Load) / ELT ( Extract, Load and Transform) process, data is extracted from multiple data sources like RDBMS, NOSQL DBs, WebServer Logs, Third party data sets, Flat files, Machine Data etc, and then is transformed into structured format to support queries. Finally data is loaded into HDFS or any NoSQL DBs for customer / data scientists etc to view.

First Technical problem is processing petabytes of data takes days or weeks. Generating small and “representative” data sets for different data sources and clients quickly is challenging. Running an entire or part of historical data hinders the overall agile development process. We seek to find a smaller data set that represents from this larger population using the characteristics of domain-specific constraints, business constraints, referential constraints, statistical distribution, and other constraints, which is called representative data set here after. Second, data is transferred and transformed at different transition points during an ETL process. Should we validate the data at one point or all of them? Third, how should we validate transferred and transformed data? Manually validating high volumes of data is prohibitively costly, so this must be automated.

Test Framework—

To solve these three technical problems above, we have a scalable big data test framework to generate representative data sets and validate data transfers and transformation. Figure 1 shows that data coming from different sources (on the left) is stored at various storage services and regular storage places (on the right). The purpose of the test framework is to generate a representative data set from a large original data set. Both data sets may be stored at the same or different services. The representative data set can be used for the validation of data transfer and transformation. The test framework validates the data transfer by comparing the data before and after the transfer. Likewise, the framework also validates the data transformation with the requirements that specify how data is transformed.

Test Data Generation—

To control the size of the test set input space partitioning is used. Input space partition testing starts with an input-domain model (IDM). The tester partitions the IDM, selects test values from partitioned blocks, and applies combinatorial coverage criteria to generate tests. Figure 2 shows the general process of test data generation.

Consider “audits” (another structural/non structured data other than database) are generated to reflect how data are changed. To create a representative data set from lots of audits using IDMs, need to extract test values for every attribute of the audits. We will have to write a grammar to describe the structure of these audits. The test framework then parses the grammar against the audits to collect all test values, including nested attributes, and computes statistical distributions of the test values, resulting in a parse tree. After analyzing the test values and constraints, the IDMs for every attribute are generated and merged into the parse tree. The final representative data set is generated from the parse tree, which has audit attributes, their relationships, constraints, and IDMs. The representative data set is used and evaluated through repeated use. Here we can also leverage machine-learning techniques to improve the IDMs for accuracy and consistent representative data set generation. Also use parallel computing and Hadoop to speed up data generation. Even if the process is slow, only need to generate an initial representative data set once for a project. Then the data set will be refined incrementally to adjust to changing constraints and user values.

Data Validation—

During ETL processes, we need to validate data transfer and transformation to ensure that data integrity is not compromised. Validating data transfer is relatively simple because we know expected values are equal to original values. If the data is transferred from a database to another, we can validate the source and target data quickly by checking the number of columns, rows, and the columns’ names and data types. Best practice is evaluating row count of sum of each column in a table; the validation can be automated when source and target data are provided. Validating data transformation is more complicated. For instance, we may aggregate data from ten source tables into one target table. Some of the columns of the target table use the same data types as original while other columns may use different data types. We have two plans at different validation granularity levels. First, we validate whether the target data has correct data types and value ranges at a high level. The test framework derives data types and value ranges from requirements, then generates test to validate the target date.

The second plan derives detailed specifications to validate every transformation rule. The test framework compares the source data with the target data to evaluate whether the target data was transformed correctly. Both plans require testers to write the transformation specification in a format that the test framework can read. Then the framework automatically analyzes the specification and generates tests to validate the transformation.

Friday, May 27, 2016

Splunk Standalone node installations

Splunk Standalone node installations

Environment Setup

· Download Ubuntu Linux 12.04.3 LTS or CentOS 6.7 (Any linux flavor)

· Download Latest Splunk Enterprise tar ball ( For simplicity)

Pre-requisites

· Do the necessary network settings and assign static IP if preferred.

· Assign the hostname in network and hosts files.

· All hosts must be recent version of Linux x86_64 (kernel +2.6)

- Python 2.7 must installed and present in PATH. Python 3 is NOT supported.

- Additionally, the following Python modules must be installed:

- pycrypto (needed by paramiko)

- simplejson

- pyyaml

* These modules can usually be installed using 'pip'

- sar must be installed on all systems.

- needed for metrics

- All hosts must be reachable from each other.

- DNS must be working or, alternatively, host files must be working

- All hosts must be running SSH.

- Preferably, SSH keys should be exchanged to all

- Time is synchronized on all hosts.

- This is *extremely* important as many measurements depend on time

accuracy.

Install python 2.7

https://wiki.centos.org/AdditionalResources/Repositories/SCL

Redhat now has "Software Collections" which take care of this sort of thing, so you can do:

yum install centos-release-SCL
yum install python27

Then if you want to use if in your shell you would run something like:

scl enable python27 bash

Which sets up the correct environment variables (including PATH and LD_LIBRARY_PATH etc) and dumps you into a new shell - pretty sure it would't be too hard to make that the default.....

Install pycrypto

https://docs.datastax.com/en/latest-opsc/opsc/configure/installPycrypto.html

yum install gmp-devel

pip install pycrypto

Install SimpleJson.

https://pypi.python.org/pypi/simplejson/

Install Pyyaml

http://pyyaml.org/wiki/PyYAML

Install sar

yum install syssstat

Above installations required only if you're building a solution to run Benchmark tools.

Splunk Enterprice installation.

Untar the splunk tar ball to the /opt location (or any other preferred one)

[root@splunk_standalone sbk_files]# tar -zxvf splunk-6.3.1-f3e41e4b37b2-Linux-x86_64.tgz -C /opt

Go the /opt check the splunk folder created

[root@splunk_standalone opt]# ll

total 11936

-rw-r--r--. 1 root root 1522812 Mar 5 12:15 get-pip.py

drwxr-xr-x. 18 1000 1000 4096 Apr 14 21:38 Python-2.7.6

-rw-r--r--. 1 root root 10431288 Nov 10 2013 Python-2.7.6.tar.xz

drwxr-xr-x. 8 root root 4096 Apr 14 22:04 PyYAML-3.11

-rw-r--r--. 1 root root 248685 Mar 26 2014 PyYAML-3.11.tar.gz

drwxr-xr-x. 2 root root 4096 Mar 26 2015 rh

drwxr-xr-x. 8 506 506 4096 Oct 30 2015 splunk

[root@splunk_standalone opt]#

Check the files installed under Splunk directory.

[root@splunk_standalone opt]# cd splunk/

[root@splunk_standalone splunk]# ll

total 1796

drwxr-xr-x. 4 506 506 4096 Oct 30 2015 bin

drwxr-xr-x. 14 506 506 4096 Oct 30 2015 etc

-rw-r--r--. 1 506 506 0 Oct 30 2015 ftr

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 include

drwxr-xr-x. 6 506 506 4096 Oct 30 2015 lib

-r--r--r--. 1 506 506 62027 Oct 30 2015 license-eula.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 openssl

-r--r--r--. 1 506 506 509 Oct 30 2015 README-splunk.txt

drwxr-xr-x. 3 506 506 4096 Oct 30 2015 share

-r--r--r--. 1 506 506 1737206 Oct 30 2015 splunk-6.3.1-f3e41e4b37b2-linux-2.6-x86_64-manifest

[root@splunk_standalone splunk]#

Now start the splunk first time and accept the license

[root@splunk_standalone splunk]# ./bin/splunk start --accept-license

This appears to be your first time running this version of Splunk.

Copying '/opt/splunk/etc/openldap/ldap.conf.default' to '/opt/splunk/etc/openldap/ldap.conf'.

Generating RSA private key, 1024 bit long modulus

..........................++++++

..++++++

e is 65537 (0x10001)

writing RSA key

Generating RSA private key, 1024 bit long modulus

...++++++

...........++++++

e is 65537 (0x10001)

writing RSA key

Moving '/opt/splunk/share/splunk/search_mrsparkle/modules.new' to '/opt/splunk/share/splunk/search_mrsparkle/modules'.

Splunk> CSI: Logfiles.

Checking prerequisites...

Checking http port [8000]: open

Checking mgmt port [8089]: open

Checking appserver port [127.0.0.1:8065]: open

Checking kvstore port [8191]: open

Checking configuration... Done.

Creating: /opt/splunk/var/lib/splunk

Creating: /opt/splunk/var/run/splunk

Creating: /opt/splunk/var/run/splunk/appserver/i18n

Creating: /opt/splunk/var/run/splunk/appserver/modules/static/css

Creating: /opt/splunk/var/run/splunk/upload

Creating: /opt/splunk/var/spool/splunk

Creating: /opt/splunk/var/spool/dirmoncache

Creating: /opt/splunk/var/lib/splunk/authDb

Creating: /opt/splunk/var/lib/splunk/hashDb

Checking critical directories... Done

Checking indexes...

Validated: _audit _internal _introspection _thefishbucket history main summary

Done

New certs have been generated in '/opt/splunk/etc/auth'.

Checking filesystem compatibility... Done

Checking conf files for problems...

Done

Checking default conf files for edits...

Validating installed files against hashes from '/opt/splunk/splunk-6.3.1-f3e41e4b37b2-linux-2.6-x86_64-manifest'

All installed files intact.

Done

All preliminary checks passed.

Starting splunk server daemon (splunkd)...

Generating a 1024 bit RSA private key

...............................................................................................++++++

..............++++++

writing new private key to 'privKeySecure.pem'

-----

Signature ok

subject=/CN=splunk_standalone/O=SplunkUser

Getting CA Private Key

writing RSA key

Done

[ OK ]

Waiting for web server at http://127.0.0.1:8000 to be available...... Done

If you get stuck, we're here to help.

Look for answers here: http://docs.splunk.com

The Splunk web interface is at http://splunk_standalone:8000

Start the Splunk Web Interface at http://splunk_standalone:8000

First time when you open the page you’ll have to change the password, default is admin and changeme

Password can start be set in the backend in CLI

splunk edit user admin -password <New_Splunk_Admin_Password> -role admin -auth admin:changeme

Under SettingàSystemsàlicense change the license group to free license and you’re all set to go

Finally.

Added in bash profile file Splunkhome and bin in path.

Splunk start

Splunk restart

Splunk stop

Above will start both splunkd and splunkweb deamons.