Tuesday, May 31, 2016

Effective Agile Test Framework for Big data

Abstract –
As saying goes “Simplicity is the ultimate sophistications” the Testing frame work should be simple, agile, collaborative and scalable, which we will design and develop here to address these 3 major challenges , faced while testing any Bigdata applications especially Hadoop based techniques, First  Processing Time to crunch huge data sets? Second is validation required for all the phases of data transformation and transitions? Third how should we validating transformed/ processed data?

The framework developed here will address the above challenges; framework generates a small representative data set from any original large data set using space partition testing. Using this data set for development and testing would not hinder the continuous integration and delivery when using agile processes. The test framework also accesses and validates data at various transition points when data is transferred and transformed.

Introduction—
Big data is a big topic these days, one that has made its way up to the executive level. Most organizations may not yet fully understand what big data is, exactly, but they know he or she needs a plan for managing it. The big data construct theory described by Gartner’s has 3Vs – data Volume, Velocity, and Variety), are all growing dramatically and we’re being asked to process data ever more quickly so we can respond to events as they happen, and that data is coming from an ever wider array of channels, sensors and formats.



The velocity of data generation is high and these data have to be analyzed in a timely manner. People may derive valuable information by mining high volumes of datasets in various formats. Many programs that use big data techniques (e.g., Hadoop1/2) and processes big data are currently being developed. How to test such software effectively and efficiently is challenging. Currently presented issues in big data testing and proposed solution that generates huge and realistic data for databases likeYCSB Benchmarking for NoSQL DBs, SBK conductor tool Kit for Splunk etc. This paper focuses on generating small and representative datasets from very large sets of data. This can save the cost of processing large amounts of data, which hinders continuous integration and delivery during agile development. This paper introduces an Effective Agile scalable big data test framework to test Extract, Transform, and Load (ETL/ELT) applications that use big data techniques.

ETL (Extract, Transform and Load) / ELT ( Extract, Load and Transform) process, data is extracted from multiple data sources like RDBMS, NOSQL DBs, WebServer Logs, Third party data sets, Flat files, Machine Data etc, and then is transformed into structured format to support queries. Finally data is loaded into HDFS or any NoSQL DBs for customer / data scientists etc to view.

First Technical problem is processing petabytes of data takes days or weeks. Generating small and “representative” data sets for different data sources and clients quickly is challenging. Running an entire or part of historical data hinders the overall agile development process. We seek to find a smaller data set that represents from this larger population using the characteristics of domain-specific constraints, business constraints, referential constraints, statistical distribution, and other constraints, which is called representative data set here after. Second, data is transferred and transformed at different transition points during an ETL process. Should we validate the data at one point or all of them? Third, how should we validate transferred and transformed data? Manually validating high volumes of data is prohibitively costly, so this must be automated.

Test Framework—
To solve these three technical problems above, we have a scalable big data test framework to generate representative data sets and validate data transfers and transformation. Figure 1 shows that data coming from different sources (on the left) is stored at various storage services and regular storage places (on the right). The purpose of the test framework is to generate a representative data set from a large original data set. Both data sets may be stored at the same or different services. The representative data set can be used for the validation of data transfer and transformation. The test framework validates the data transfer by comparing the data before and after the transfer. Likewise, the framework also validates the data transformation with the requirements that specify how data is transformed.




Test Data Generation—
To control the size of the test set input space partitioning is used. Input space partition testing starts with an input-domain model (IDM). The tester partitions the IDM, selects test values from partitioned blocks, and applies combinatorial coverage criteria to generate tests. Figure 2 shows the general process of test data generation.




Consider “audits” (another structural/non structured data other than database) are generated to reflect how data are changed. To create a representative data set from lots of audits using IDMs, need to extract test values for every attribute of the audits. We will have to write a grammar to describe the structure of these audits. The test framework then parses the grammar against the audits to collect all test values, including nested attributes, and computes statistical distributions of the test values, resulting in a parse tree. After analyzing the test values and constraints, the IDMs for every attribute are generated and merged into the parse tree. The final representative data set is generated from the parse tree, which has audit attributes, their relationships, constraints, and IDMs. The representative data set is used and evaluated through repeated use.  Here we can also leverage machine-learning techniques to improve the IDMs for accuracy and consistent representative data set generation. Also use parallel computing and Hadoop to speed up data generation. Even if the process is slow, only need to generate an initial representative data set once for a project. Then the data set will be refined incrementally to adjust to changing constraints and user values.


Data Validation—
During ETL processes, we need to validate data transfer and transformation to ensure that data integrity is not compromised. Validating data transfer is relatively simple because we know expected values are equal to original values. If the data is transferred from a database to another, we can validate the source and target data quickly by checking the number of columns, rows, and the columns’ names and data types. Best practice is evaluating row count of sum of each column in a table; the validation can be automated when source and target data are provided. Validating data transformation is more complicated. For instance, we may aggregate data from ten source tables into one target table. Some of the columns of the target table use the same data types as original while other columns may use different data types. We have two plans at different validation granularity levels. First, we validate whether the target data has correct data types and value ranges at a high level. The test framework derives data types and value ranges from requirements, then generates test to validate the target date.

The second plan derives detailed specifications to validate every transformation rule. The test framework compares the source data with the target data to evaluate whether the target data was transformed correctly. Both plans require testers to write the transformation specification in a format that the test framework can read. Then the framework automatically analyzes the specification and generates tests to validate the transformation.            





Friday, May 27, 2016

Splunk Standalone node installations

Splunk Standalone node installations

Environment Setup
·         Download Ubuntu Linux 12.04.3 LTS or CentOS 6.7 (Any linux flavor)
·         Download Latest Splunk Enterprise tar ball ( For simplicity)

Pre-requisites

·        Do the necessary network settings and assign static IP if preferred.
·        Assign the hostname in network and hosts files.
·        All hosts must be recent version of Linux x86_64 (kernel +2.6)
    - Python 2.7 must installed and present in PATH. Python 3 is NOT supported.
        - Additionally, the following Python modules must be installed:
            - pycrypto (needed by paramiko)
            - simplejson
            - pyyaml
        * These modules can usually be installed using 'pip'
    - sar must be installed on all systems.
        - needed for metrics
    - All hosts must be reachable from each other.
        - DNS must be working or, alternatively, host files must be working
    - All hosts must be running SSH.
        - Preferably, SSH keys should be exchanged to all
    - Time is synchronized on all hosts.
        - This is *extremely* important as many measurements depend on time
          accuracy.

Install python 2.7
Redhat now has "Software Collections" which take care of this sort of thing, so you can do:
yum install centos-release-SCL
yum install python27
Then if you want to use if in your shell you would run something like:
scl enable python27 bash
Which sets up the correct environment variables (including PATH and LD_LIBRARY_PATH etc) and dumps you into a new shell - pretty sure it would't be too hard to make that the default.....

Install pycrypto
yum install gmp-devel
pip install pycrypto

Install SimpleJson.
Install Pyyaml
Install sar
yum install syssstat

Above installations required only if you're building a solution to run Benchmark tools. 

Splunk Enterprice installation.
Untar the splunk tar ball to the /opt location (or any other preferred one)
[root@splunk_standalone sbk_files]# tar -zxvf splunk-6.3.1-f3e41e4b37b2-Linux-x86_64.tgz -C /opt

Go the /opt check the splunk folder created
[root@splunk_standalone opt]# ll
total 11936
-rw-r--r--.  1 root root  1522812 Mar  5 12:15 get-pip.py
drwxr-xr-x. 18 1000 1000     4096 Apr 14 21:38 Python-2.7.6
-rw-r--r--.  1 root root 10431288 Nov 10  2013 Python-2.7.6.tar.xz
drwxr-xr-x.  8 root root     4096 Apr 14 22:04 PyYAML-3.11
-rw-r--r--.  1 root root   248685 Mar 26  2014 PyYAML-3.11.tar.gz
drwxr-xr-x.  2 root root     4096 Mar 26  2015 rh
drwxr-xr-x.  8  506  506     4096 Oct 30  2015 splunk
[root@splunk_standalone opt]#

Check the files installed under Splunk directory.
[root@splunk_standalone opt]# cd splunk/
[root@splunk_standalone splunk]# ll
total 1796
drwxr-xr-x.  4 506 506    4096 Oct 30  2015 bin
-r--r--r--.  1 506 506      57 Oct 30  2015 copyright.txt
drwxr-xr-x. 14 506 506    4096 Oct 30  2015 etc
-rw-r--r--.  1 506 506       0 Oct 30  2015 ftr
drwxr-xr-x.  3 506 506    4096 Oct 30  2015 include
drwxr-xr-x.  6 506 506    4096 Oct 30  2015 lib
-r--r--r--.  1 506 506   62027 Oct 30  2015 license-eula.txt
drwxr-xr-x.  3 506 506    4096 Oct 30  2015 openssl
-r--r--r--.  1 506 506     509 Oct 30  2015 README-splunk.txt
drwxr-xr-x.  3 506 506    4096 Oct 30  2015 share
-r--r--r--.  1 506 506 1737206 Oct 30  2015 splunk-6.3.1-f3e41e4b37b2-linux-2.6-x86_64-manifest
[root@splunk_standalone splunk]#



Now start the splunk first time and accept the license
[root@splunk_standalone splunk]# ./bin/splunk start --accept-license

This appears to be your first time running this version of Splunk.
Copying '/opt/splunk/etc/openldap/ldap.conf.default' to '/opt/splunk/etc/openldap/ldap.conf'.
Generating RSA private key, 1024 bit long modulus
..........................++++++
..++++++
e is 65537 (0x10001)
writing RSA key

Generating RSA private key, 1024 bit long modulus
...++++++
...........++++++
e is 65537 (0x10001)
writing RSA key

Moving '/opt/splunk/share/splunk/search_mrsparkle/modules.new' to '/opt/splunk/share/splunk/search_mrsparkle/modules'.

Splunk> CSI: Logfiles.

Checking prerequisites...
        Checking http port [8000]: open
        Checking mgmt port [8089]: open
        Checking appserver port [127.0.0.1:8065]: open
        Checking kvstore port [8191]: open
        Checking configuration...  Done.
                Creating: /opt/splunk/var/lib/splunk
                Creating: /opt/splunk/var/run/splunk
                Creating: /opt/splunk/var/run/splunk/appserver/i18n
                Creating: /opt/splunk/var/run/splunk/appserver/modules/static/css
                Creating: /opt/splunk/var/run/splunk/upload
                Creating: /opt/splunk/var/spool/splunk
                Creating: /opt/splunk/var/spool/dirmoncache
                Creating: /opt/splunk/var/lib/splunk/authDb
                Creating: /opt/splunk/var/lib/splunk/hashDb
        Checking critical directories...        Done
        Checking indexes...
                Validated: _audit _internal _introspection _thefishbucket history main summary
        Done
New certs have been generated in '/opt/splunk/etc/auth'.
        Checking filesystem compatibility...  Done
        Checking conf files for problems...
        Done
        Checking default conf files for edits...
        Validating installed files against hashes from '/opt/splunk/splunk-6.3.1-f3e41e4b37b2-linux-2.6-x86_64-manifest'
        All installed files intact.
        Done
All preliminary checks passed.

Starting splunk server daemon (splunkd)...
Generating a 1024 bit RSA private key
...............................................................................................++++++
..............++++++
writing new private key to 'privKeySecure.pem'
-----
Signature ok
subject=/CN=splunk_standalone/O=SplunkUser
Getting CA Private Key
writing RSA key
Done
                                                           [  OK  ]

Waiting for web server at http://127.0.0.1:8000 to be available...... Done


If you get stuck, we're here to help.
Look for answers here: http://docs.splunk.com

The Splunk web interface is at http://splunk_standalone:8000

Start the Splunk Web Interface at http://splunk_standalone:8000
First time when you open the page you’ll have to change the password, default is admin and changeme





Password can start be set in the backend in CLI
splunk edit user admin -password <New_Splunk_Admin_Password> -role admin -auth admin:changeme


  

Under SettingàSystemsàlicense change the license group to free license and you’re all set to go




Finally.
Added in bash profile file Splunkhome and bin in path.

Splunk start
Splunk restart
Splunk stop

Above will start both splunkd and splunkweb deamons.


Monday, May 23, 2016

Splunk HUNK installations

Splunk HUNK installations

Environment Setup
·        Download Ubuntu Linux 12.04.3 LTS or CentOS 6.7 (Any linux flavor)
·        Download Latest Splunk Enterprise tar ball if you’ve HUNK valid license.
·        If no license then download hunk older version from here
HUNK Download page


·        Download and install any Hadoop Suites(CDH, HDP etc..) or Install Hadoop Standalone node.(I’m using CDH)

 

Pre-requisites

·         Do the necessary network settings and assign static IP if preferred.
·         Assign the hostname in network and hosts files.

Step 1> Identify the nodes with their roles and assign hostname and name for simplicity and easy management. In the below scenario I’m setting up 1 Hadoop Standalone node.





Step 2> Install Splunk Enterprise binaries on the Search Head, Indexer_01 and Indexer_02 assigned machines. (Extract the Splunk tar ball to the /opt folder as explained in the single node installation)
[root@splunk_standalone sbk_files]# tar -zxvf splunk-6.3.1-f3e41e4b37b2-Linux-x86_64.tgz -C /opt

Hadoop Standalone Node
[root@quickstart splunk]# pwd
/opt/splunk
[root@quickstart splunk]# ll
total 1848
drwxr-xr-x  4  506  506    4096 Mar 25 20:45 bin
-r--r--r--  1  506  506      57 Mar 25 20:05 copyright.txt
drwxr-xr-x 16  506  506    4096 May 12 23:58 etc
drwxr-xr-x  3  506  506    4096 Mar 25 20:40 include
drwxr-xr-x  6  506  506    4096 Mar 25 20:45 lib
-r--r--r--  1  506  506   63969 Mar 25 20:05 license-eula.txt
drwxr-xr-x  3  506  506    4096 Mar 25 20:40 openssl
-r--r--r--  1  506  506     842 Mar 25 20:07 README-splunk.txt
drwxr-xr-x  3  506  506    4096 Mar 25 20:40 share
-r--r--r--  1  506  506 1786388 Mar 25 20:46 splunk-6.4.0-f2c836328108-linux-2.6-x86_64-manifest
drwx--x---  6 root root    4096 May 12 23:57 var
[root@quickstart splunk]#


[root@quickstart hadoop]# pwd
/usr/lib/hadoop
[root@quickstart hadoop]# ll
total 64
drwxr-xr-x 2 root root  4096 Apr  6 01:21 bin
drwxr-xr-x 2 root root 12288 Apr  6 01:00 client
drwxr-xr-x 2 root root  4096 Apr  6 01:00 client-0.20
drwxr-xr-x 2 root root  4096 Apr  6 01:14 cloudera
drwxr-xr-x 2 root root  4096 Apr  6 00:59 etc
lrwxrwxrwx 1 root root    48 Apr  6 01:23 hadoop-annotations-2.6.0-cdh5.7.0.jar -> ../../jars/hadoop-annotations-2.6.0-cdh5.7.0.jar
lrwxrwxrwx 1 root root    37 Apr  6 00:59 hadoop-annotations.jar -> hadoop-annotations-2.6.0-cdh5.7.0.jar
lrwxrwxrwx 1 root root    41 Apr  6 01:23 hadoop-auth-2.6.0-cdh5.7.0.jar -> ../../jars/hadoop-auth-2.6.0-cdh5.7.0.jar
lrwxrwxrwx 1 root root    30 Apr  6 00:59 hadoop-auth.jar -> hadoop-auth-2.6.0-cdh5.7.0.jar
lrwxrwxrwx 1 root root    40 Apr  6 01:23 hadoop-aws-2.6.0-cdh5.7.0.jar -> ../../jars/hadoop-aws-2.6.0-cdh5.7.0.jar
lrwxrwxrwx 1 root root    29 Apr  6 00:59 hadoop-aws.jar -> hadoop-aws-2.6.0-cdh5.7.0.jar
lrwxrwxrwx 1 root root    43 Apr  6 01:23 hadoop-common-2.6.0-cdh5.7.0.jar -> ../../jars/hadoop-common-2.6.0-cdh5.7.0.jar
lrwxrwxrwx 1 root root    49 Apr  6 01:23 hadoop-common-2.6.0-cdh5.7.0-tests.jar -> ../../jars/hadoop-common-2.6.0-cdh5.7.0-tests.jar
lrwxrwxrwx 1 root root    32 Apr  6 00:59 hadoop-common.jar -> hadoop-common-2.6.0-cdh5.7.0.jar
lrwxrwxrwx 1 root root    38 Apr  6 00:59 hadoop-common-tests.jar -> hadoop-common-2.6.0-cdh5.7.0-tests.jar
lrwxrwxrwx 1 root root    40 Apr  6 01:23 hadoop-nfs-2.6.0-cdh5.7.0.jar -> ../../jars/hadoop-nfs-2.6.0-cdh5.7.0.jar
lrwxrwxrwx 1 root root    29 Apr  6 00:59 hadoop-nfs.jar -> hadoop-nfs-2.6.0-cdh5.7.0.jar
drwxr-xr-x 3 root root  4096 Apr  6 01:23 lib
drwxr-xr-x 2 root root  4096 Apr  6 01:21 libexec
-rw-r--r-- 1 root root 17087 Mar 23 12:01 LICENSE.txt
-rw-r--r-- 1 root root   101 Mar 23 12:01 NOTICE.txt
lrwxrwxrwx 1 root root    27 Apr  6 00:59 parquet-avro.jar -> ../parquet/parquet-avro.jar
lrwxrwxrwx 1 root root    32 Apr  6 00:59 parquet-cascading.jar -> ../parquet/parquet-cascading.jar
lrwxrwxrwx 1 root root    29 Apr  6 00:59 parquet-column.jar -> ../parquet/parquet-column.jar
lrwxrwxrwx 1 root root    29 Apr  6 00:59 parquet-common.jar -> ../parquet/parquet-common.jar
lrwxrwxrwx 1 root root    31 Apr  6 00:59 parquet-encoding.jar -> ../parquet/parquet-encoding.jar
lrwxrwxrwx 1 root root    29 Apr  6 00:59 parquet-format.jar -> ../parquet/parquet-format.jar
lrwxrwxrwx 1 root root    37 Apr  6 00:59 parquet-format-javadoc.jar -> ../parquet/parquet-format-javadoc.jar
lrwxrwxrwx 1 root root    37 Apr  6 00:59 parquet-format-sources.jar -> ../parquet/parquet-format-sources.jar
lrwxrwxrwx 1 root root    32 Apr  6 00:59 parquet-generator.jar -> ../parquet/parquet-generator.jar
lrwxrwxrwx 1 root root    36 Apr  6 00:59 parquet-hadoop-bundle.jar -> ../parquet/parquet-hadoop-bundle.jar
lrwxrwxrwx 1 root root    29 Apr  6 00:59 parquet-hadoop.jar -> ../parquet/parquet-hadoop.jar
lrwxrwxrwx 1 root root    30 Apr  6 00:59 parquet-jackson.jar -> ../parquet/parquet-jackson.jar
lrwxrwxrwx 1 root root    33 Apr  6 00:59 parquet-pig-bundle.jar -> ../parquet/parquet-pig-bundle.jar
lrwxrwxrwx 1 root root    26 Apr  6 00:59 parquet-pig.jar -> ../parquet/parquet-pig.jar
lrwxrwxrwx 1 root root    31 Apr  6 00:59 parquet-protobuf.jar -> ../parquet/parquet-protobuf.jar
lrwxrwxrwx 1 root root    33 Apr  6 00:59 parquet-scala_2.10.jar -> ../parquet/parquet-scala_2.10.jar
lrwxrwxrwx 1 root root    35 Apr  6 00:59 parquet-scrooge_2.10.jar -> ../parquet/parquet-scrooge_2.10.jar
lrwxrwxrwx 1 root root    35 Apr  6 00:59 parquet-test-hadoop2.jar -> ../parquet/parquet-test-hadoop2.jar
lrwxrwxrwx 1 root root    29 Apr  6 00:59 parquet-thrift.jar -> ../parquet/parquet-thrift.jar
lrwxrwxrwx 1 root root    28 Apr  6 00:59 parquet-tools.jar -> ../parquet/parquet-tools.jar
drwxr-xr-x 2 root root  4096 Apr  6 00:59 sbin
[root@quickstart hadoop]#


Configuring the Hadoop Node
1.      Create working directory in the Hadoop user path (here I’m using root as the user and creating splunkmr directory under it).
[root@quickstart hadoop]# hadoop fs -ls -R /user/root
drwxr-xr-x   - root supergroup          0 2016-05-13 01:30 /user/root/splunkmr
drwxr-xr-x   - root supergroup          0 2016-05-13 01:30 /user/root/splunkmr/bundles
-rw-r--r--   1 root supergroup   26880000 2016-05-13 01:30 /user/root/splunkmr/bundles/quickstart.cloudera-1463128021.bundle
drwxr-xr-x   - root supergroup          0 2016-05-13 02:22 /user/root/splunkmr/dispatch
drwxr-xr-x   - root supergroup          0 2016-05-13 02:21 /user/root/splunkmr/dispatch/1463131309.36
-rw-r--r--   1 root supergroup          0 2016-05-13 02:21 /user/root/splunkmr/dispatch/1463131309.36/1.hb
drwxr-xr-x   - root supergroup          0 2016-05-13 02:22 /user/root/splunkmr/dispatch/1463131339.37
-rw-r--r--   1 root supergroup          0 2016-05-13 02:22 /user/root/splunkmr/dispatch/1463131339.37/1.hb
drwxr-xr-x   - root supergroup          0 2016-05-13 01:30 /user/root/splunkmr/jars
-rw-r--r--   1 root supergroup     303139 2016-05-13 01:30 /user/root/splunkmr/jars/avro-1.7.4.jar
-rw-r--r--   1 root supergroup     166557 2016-05-13 01:30 /user/root/splunkmr/jars/avro-mapred-1.7.4.jar
-rw-r--r--   1 root supergroup     256241 2016-05-13 01:30 /user/root/splunkmr/jars/commons-compress-1.5.jar
-rw-r--r--   1 root supergroup     163151 2016-05-13 01:30 /user/root/splunkmr/jars/commons-io-2.1.jar
-rw-r--r--   1 root supergroup    9111670 2016-05-13 01:30 /user/root/splunkmr/jars/hive-exec-0.12.0.jar
-rw-r--r--   1 root supergroup    3342729 2016-05-13 01:30 /user/root/splunkmr/jars/hive-metastore-0.12.0.jar
-rw-r--r--   1 root supergroup     709070 2016-05-13 01:30 /user/root/splunkmr/jars/hive-serde-0.12.0.jar
-rw-r--r--   1 root supergroup     275186 2016-05-13 01:30 /user/root/splunkmr/jars/libfb303-0.9.0.jar
-rw-r--r--   1 root supergroup    2664668 2016-05-13 01:30 /user/root/splunkmr/jars/parquet-hive-bundle-1.5.0.jar
-rw-r--r--   1 root supergroup    1251514 2016-05-13 01:30 /user/root/splunkmr/jars/snappy-java-1.0.5.jar
drwxr-xr-x   - root supergroup          0 2016-05-13 01:30 /user/root/splunkmr/packages
-rw-r--r--   1 root supergroup   59506554 2016-05-13 01:30 /user/root/splunkmr/packages/hunk-6.2.2-257696-linux-2.6-x86_64.tgz
[root@quickstart hadoop]#

2.      Create a new directory on HDFS to place data or use any existing data path(I’ve created /data folder and placed a sample data file)
[root@quickstart hadoop]# hadoop fs -ls -R /data
-rw-r--r--   1 root supergroup   59367278 2016-05-13 01:27 /data/Hunkdata.json.gz
[root@quickstart hadoop]#


Configuring the Hadoop Node and HUNK Web UI
3.      Login into Web UI and select Virtual Index.
4.      Configure new provider ( this is to point HUNK to the Hadoop Cluster)





5.      Provide jobtracker URI and HDFS URI and a valid provide name.
get these details from core-site.xml and mapreduce-site.xml, under /usr/lib/hadoop/etc/conf




6.      Configure new virtual inderxer ( this is data are stored on the hdfs )




7.      Once set directly start searching from the search option provided in the Virtual indexer page.