Tuesday, May 31, 2016

Effective Agile Test Framework for Big data

Abstract –
As saying goes “Simplicity is the ultimate sophistications” the Testing frame work should be simple, agile, collaborative and scalable, which we will design and develop here to address these 3 major challenges , faced while testing any Bigdata applications especially Hadoop based techniques, First  Processing Time to crunch huge data sets? Second is validation required for all the phases of data transformation and transitions? Third how should we validating transformed/ processed data?

The framework developed here will address the above challenges; framework generates a small representative data set from any original large data set using space partition testing. Using this data set for development and testing would not hinder the continuous integration and delivery when using agile processes. The test framework also accesses and validates data at various transition points when data is transferred and transformed.

Introduction—
Big data is a big topic these days, one that has made its way up to the executive level. Most organizations may not yet fully understand what big data is, exactly, but they know he or she needs a plan for managing it. The big data construct theory described by Gartner’s has 3Vs – data Volume, Velocity, and Variety), are all growing dramatically and we’re being asked to process data ever more quickly so we can respond to events as they happen, and that data is coming from an ever wider array of channels, sensors and formats.



The velocity of data generation is high and these data have to be analyzed in a timely manner. People may derive valuable information by mining high volumes of datasets in various formats. Many programs that use big data techniques (e.g., Hadoop1/2) and processes big data are currently being developed. How to test such software effectively and efficiently is challenging. Currently presented issues in big data testing and proposed solution that generates huge and realistic data for databases likeYCSB Benchmarking for NoSQL DBs, SBK conductor tool Kit for Splunk etc. This paper focuses on generating small and representative datasets from very large sets of data. This can save the cost of processing large amounts of data, which hinders continuous integration and delivery during agile development. This paper introduces an Effective Agile scalable big data test framework to test Extract, Transform, and Load (ETL/ELT) applications that use big data techniques.

ETL (Extract, Transform and Load) / ELT ( Extract, Load and Transform) process, data is extracted from multiple data sources like RDBMS, NOSQL DBs, WebServer Logs, Third party data sets, Flat files, Machine Data etc, and then is transformed into structured format to support queries. Finally data is loaded into HDFS or any NoSQL DBs for customer / data scientists etc to view.

First Technical problem is processing petabytes of data takes days or weeks. Generating small and “representative” data sets for different data sources and clients quickly is challenging. Running an entire or part of historical data hinders the overall agile development process. We seek to find a smaller data set that represents from this larger population using the characteristics of domain-specific constraints, business constraints, referential constraints, statistical distribution, and other constraints, which is called representative data set here after. Second, data is transferred and transformed at different transition points during an ETL process. Should we validate the data at one point or all of them? Third, how should we validate transferred and transformed data? Manually validating high volumes of data is prohibitively costly, so this must be automated.

Test Framework—
To solve these three technical problems above, we have a scalable big data test framework to generate representative data sets and validate data transfers and transformation. Figure 1 shows that data coming from different sources (on the left) is stored at various storage services and regular storage places (on the right). The purpose of the test framework is to generate a representative data set from a large original data set. Both data sets may be stored at the same or different services. The representative data set can be used for the validation of data transfer and transformation. The test framework validates the data transfer by comparing the data before and after the transfer. Likewise, the framework also validates the data transformation with the requirements that specify how data is transformed.




Test Data Generation—
To control the size of the test set input space partitioning is used. Input space partition testing starts with an input-domain model (IDM). The tester partitions the IDM, selects test values from partitioned blocks, and applies combinatorial coverage criteria to generate tests. Figure 2 shows the general process of test data generation.




Consider “audits” (another structural/non structured data other than database) are generated to reflect how data are changed. To create a representative data set from lots of audits using IDMs, need to extract test values for every attribute of the audits. We will have to write a grammar to describe the structure of these audits. The test framework then parses the grammar against the audits to collect all test values, including nested attributes, and computes statistical distributions of the test values, resulting in a parse tree. After analyzing the test values and constraints, the IDMs for every attribute are generated and merged into the parse tree. The final representative data set is generated from the parse tree, which has audit attributes, their relationships, constraints, and IDMs. The representative data set is used and evaluated through repeated use.  Here we can also leverage machine-learning techniques to improve the IDMs for accuracy and consistent representative data set generation. Also use parallel computing and Hadoop to speed up data generation. Even if the process is slow, only need to generate an initial representative data set once for a project. Then the data set will be refined incrementally to adjust to changing constraints and user values.


Data Validation—
During ETL processes, we need to validate data transfer and transformation to ensure that data integrity is not compromised. Validating data transfer is relatively simple because we know expected values are equal to original values. If the data is transferred from a database to another, we can validate the source and target data quickly by checking the number of columns, rows, and the columns’ names and data types. Best practice is evaluating row count of sum of each column in a table; the validation can be automated when source and target data are provided. Validating data transformation is more complicated. For instance, we may aggregate data from ten source tables into one target table. Some of the columns of the target table use the same data types as original while other columns may use different data types. We have two plans at different validation granularity levels. First, we validate whether the target data has correct data types and value ranges at a high level. The test framework derives data types and value ranges from requirements, then generates test to validate the target date.

The second plan derives detailed specifications to validate every transformation rule. The test framework compares the source data with the target data to evaluate whether the target data was transformed correctly. Both plans require testers to write the transformation specification in a format that the test framework can read. Then the framework automatically analyzes the specification and generates tests to validate the transformation.