Abstract –
As saying goes “Simplicity is the ultimate sophistications” the Testing
frame work should be simple, agile, collaborative and scalable, which we will
design and develop here to address these 3 major challenges , faced while
testing any Bigdata applications especially Hadoop based techniques, First Processing Time to crunch huge data sets?
Second is validation required for all the phases of data transformation and
transitions? Third how should we validating transformed/ processed data?
The framework developed here will address the above challenges; framework
generates a small representative data set from any original large data set
using space partition testing. Using this data set for development and testing
would not hinder the continuous integration and delivery when using agile
processes. The test framework also accesses and validates data at various
transition points when data is transferred and transformed.
Introduction—
Big data is a big topic these days, one that has made its way up to the
executive level. Most organizations may not yet fully understand what big data
is, exactly, but they know he or she needs a plan for managing it. The big data
construct theory described by Gartner’s has 3Vs – data Volume, Velocity, and
Variety), are all growing dramatically and we’re being asked to process data
ever more quickly so we can respond to events as they happen, and that data is
coming from an ever wider array of channels, sensors and formats.
The velocity of data generation is high and these data have to be analyzed
in a timely manner. People may derive valuable information by mining high
volumes of datasets in various formats. Many programs that use big data
techniques (e.g., Hadoop1/2) and processes big data are currently being
developed. How to test such software effectively and efficiently is
challenging. Currently presented issues in big data testing and proposed
solution that generates huge and realistic data for databases likeYCSB
Benchmarking for NoSQL DBs, SBK conductor tool Kit for Splunk etc. This paper
focuses on generating small and representative datasets from very large sets of
data. This can save the cost of processing large amounts of data, which hinders
continuous integration and delivery during agile development. This paper
introduces an Effective Agile scalable big data test framework to test Extract,
Transform, and Load (ETL/ELT) applications that use big data techniques.
ETL (Extract, Transform and Load) / ELT ( Extract, Load and Transform)
process, data is extracted from multiple data sources like RDBMS, NOSQL DBs,
WebServer Logs, Third party data sets, Flat files, Machine Data etc, and then
is transformed into structured format to support queries. Finally data is
loaded into HDFS or any NoSQL DBs for customer / data scientists etc to view.
First Technical problem is processing petabytes of data takes days or
weeks. Generating small and “representative” data sets for different data
sources and clients quickly is challenging. Running an entire or part of
historical data hinders the overall agile development process. We seek to find
a smaller data set that represents from this larger population using the
characteristics of domain-specific constraints, business constraints,
referential constraints, statistical distribution, and other constraints, which
is called representative data set here after. Second, data is transferred and
transformed at different transition points during an ETL process. Should we
validate the data at one point or all of them? Third, how should we validate
transferred and transformed data? Manually validating high volumes of data is
prohibitively costly, so this must be automated.
Test Framework—
To solve these three technical problems above, we have a scalable big data
test framework to generate representative data sets and validate data transfers
and transformation. Figure 1 shows that data coming from different sources (on
the left) is stored at various storage services and regular storage places (on
the right). The purpose of the test framework is to generate a representative
data set from a large original data set. Both data sets may be stored at the
same or different services. The representative data set can be used for the
validation of data transfer and transformation. The test framework validates
the data transfer by comparing the data before and after the transfer.
Likewise, the framework also validates the data transformation with the
requirements that specify how data is transformed.
Test Data Generation—
To control the size of the test set input space partitioning is used.
Input space partition testing starts with an input-domain model (IDM). The
tester partitions the IDM, selects test values from partitioned blocks, and
applies combinatorial coverage criteria to generate tests. Figure 2 shows the
general process of test data generation.
Consider “audits” (another structural/non structured data other than
database) are generated to reflect how data are changed. To create a
representative data set from lots of audits using IDMs, need to extract test
values for every attribute of the audits. We will have to write a grammar to
describe the structure of these audits. The test framework then parses the
grammar against the audits to collect all test values, including nested
attributes, and computes statistical distributions of the test values,
resulting in a parse tree. After analyzing the test values and constraints, the
IDMs for every attribute are generated and merged into the parse tree. The
final representative data set is generated from the parse tree, which has audit
attributes, their relationships, constraints, and IDMs. The representative data
set is used and evaluated through repeated use. Here we can also leverage machine-learning
techniques to improve the IDMs for accuracy and consistent representative data
set generation. Also use parallel computing and Hadoop to speed up data
generation. Even if the process is slow, only need to generate an initial
representative data set once for a project. Then the data set will be refined
incrementally to adjust to changing constraints and user values.
Data Validation—
During
ETL processes, we need to validate data transfer and transformation to ensure
that data integrity is not compromised. Validating data transfer is relatively
simple because we know expected values are equal to original values. If the
data is transferred from a database to another, we can validate the source and
target data quickly by checking the number of columns, rows, and the columns’
names and data types. Best practice is evaluating row count of sum of each column
in a table; the validation can be automated when source and target data are
provided. Validating data transformation is more complicated. For instance, we
may aggregate data from ten source tables into one target table. Some of the
columns of the target table use the same data types as original while other
columns may use different data types. We have two plans at different validation
granularity levels. First, we validate whether the target data has correct data
types and value ranges at a high level. The test framework derives data types
and value ranges from requirements, then generates test to validate the target
date.
The second plan derives
detailed specifications to validate every transformation rule. The test
framework compares the source data with the target data to evaluate whether the
target data was transformed correctly. Both plans require testers to write the
transformation specification in a format that the test framework can read. Then
the framework automatically analyzes the specification and generates tests to
validate the transformation.