An Introduction to Big Data
What is
BigData?
Big data is a buzzword, or one can say it’s a catch-phrase,
which can be used to describe a huge volume of structured, unstructured, text,
images, audio, video, log files, emails, simulations, 3D models, military
surveillance, e-commerce and so on that is so massive that it's difficult to
process using traditional database and software techniques. In most enterprise
scenarios the data is too big or it moves too fast or it exceeds current
processing capacity. Big data is nothing but a synonym of a huge and complex
data that it becomes very tiresome, difficult or slow to collect, store, sort,
process, retrieve and analyze it with the help of any existing relational
database management tools or traditional data processing techniques. Big Data
usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process the data within a
tolerable elapsed time.
Some examples of Big Data:
·
An airline jet collects 10 terabytes of sensor data for every 30
minutes of flying time.
·
Twitter has over 500 million registered users.
1. The
USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users,
good enough to finish well ahead of Brazil, Japan, the UK and Indonesia.
2. 79% of
US Twitter users are more like to recommend brands they follow.
3. 67% of
US Twitter users are more likely to buy from brands they follow.
4. 57% of
all companies that use social media for business use Twitter.
How
fast data is increasing:
Carefully look the picture which explains us “what happens in
every 60 seconds on the internet“. By this we can understand how much data
being generated in a second, a minute, a day or a year and how exponentially it
is generating. As per the analysis by TechNewsDaily we might generate more than
8 Zettabytes of data by 2015.
Characteristics
of Big Data:
Data scientists break big data into four dimensions: volume,
variety, velocity and veracity.
·
Volume: BIG DATA depends upon
how large it is. It could amount to hundreds of terabytes or even petabytes of
information.
·
Velocity: The increasing rate at
which data flows into an organization.
·
Variety: A common theme in big
data systems is that the source data is diverse and doesn’t fall into neat
relational structures.
·
Veracity: Big Data Veracity
refers to the biases, noise and abnormality in data. Is the data that is being
stored, and mined meaningful to the problem being analyzed.
Big
Data Problems:
Traditional systems build within the company for handling the
relational databases may not be able to support/scale as data generating with
high volume, velocity and variety of data.
·
Volume: As an example,
Terabytes of posts generated on Facebook or 400 billion annual twitter tweets
could mean Big Data! This enormous amount of data will be stored somewhere to
analyze and come up with data science reports for different solutions and
problem solving approaches.
·
Velocity: Big data requires fast
processing. Time factor plays a very crucial role in several organizations. For
instance, millions of records are generated in the stock market which needs to
be stored and processed with the same speed as its coming into the system.
·
Variety: There is no specific
format of Big Data. It could be in any form such as structured, unstructured,
text, images, audio, video, log files, emails, simulations, 3D models, etc.
Until now we have been working with only structured data. It might be difficult
to handle the quality and quantity of unstructured or semi-structured data that
we are generating on a daily basis.
How Big Data handles the above problems:
·
Distributed File System (DFS): In
DFS, we can divide a large set of data files into smaller blocks and load these
blocks into multiple number of machines which will then be ready for parallel
processing. For example, if we have 1 Terabyte of data to read with 1 machine
and 4 Input/Output channels with each channel’s reading speed id 100MB/sec, the
whole 1 TB data will be read in 45 minutes. On the other hand, if we have 10
different machines, we can divide 1 TB of data into 10 machines and then the
data can be read in parallel which reduces the total time to only 4.5 minutes.
·
Parallel Processing: When
data resides on N number of servers and holds the power of N servers, then the
data can be processed in parallel for analysis, which helps the user to reduce
the wait time to generate the final report or analyzed data.
·
Fault Tolerance: The
Fault tolerance feature of Big Data frameworks (like Hadoop) is the one of the
main reason for using this framework to run jobs. Even when running jobs on a
large cluster where individual nodes or network components may experience high
rates of failure, BigData frameworks can guide jobs toward a successful
completion as the data is replicated into multiple nodes/slaves.
·
Use of Commodity Hardware: Most
of the Big Data tools and frameworks need commodity hardware for its working
which reduces the cost of the total infrastructure and very easy to add more
clusters as data size increase.